Single table design considerations
Common data munging issues
- redundancy (duplicate rows, or other repeated data across records)
- inflexible design (data sets are published for a single purpose, without thought about future use)
- inconsistent spelling/capitalization
- poorly formatted data
- missing data
- text formatting is used to indicate meaning of data (bold, italics, underline, margins, fonts, etc)
- lots of copy and instructions mixed in with the actual data
- character encoding issues
- date/time and timezone incosistency issues
- what else?
Common data content issues
- some data contains codes with special meaning, rather than actual data
- questionable content
Guidelines for creating good tables
- Every row or record contains data about one specific entity.
- Every field contains data with attributes of (or information about) that entity.
- Any given field within a specific record may only contain one value.
- All entries of a specific column or field are of the same kind; i.e. they are of the same data type.
- Every column or field has a unique name within the table.
- The order of the columns or fields is not important.
- The order of the rows or records is not important.
- With respect to the data: No two records may be identical.
Women, Minorities, and Persons with Disabilities in Science and Engineering http://www.nsf.gov/statistics/2015/nsf15311/tables.cfm
All employment, by occupation, race, and ethnicity 2004 - 2013 annual averages. http://www.nsf.gov/statistics/2015/nsf15311/tables/tab9-4.xlsx
This file exhibits some classic munging challenges:
- lots of copy and instructions: there are lots of rows in this file that are descriptive and not part of the data set.
- text formatting issue: the first column in this excel document uses indenting to indicate hierarchy. this formatting will be lost once we covert it to pure text.
- inflexible design: the rows in this data set are not self contained, and therefore cannot be directly imported into a database table without modification.
- e.g. some rows represent categories of jobs (such as "Professionals and related occupations"), while other rows represent jobs themeselves (such as "Engineer")
- e.g. each row represents statistics within a given ethnicity or race, but does not include a field with the data about which ethnicity or race it represents. A solution might be to add a column which includes that information for each row.
- missing data: there are some data points that are missing.
- special codes: some missing numbers in the data set have been replaced by the code "S". these would have to be replaced with perhaps a NULL to make it understandable by database functions.
- questionable content: the standard governmental breakdown of ethnicities and races is not based on genetic research.
Kinds of analyses we may be interested in
This data in this data set might be useful for a variety of analyses of how employment in science- and engineering-related disciplines is broken down by ethnicity and race. Some analyses that might be interesting, include:
- change in overall employment for a given ethnicity over time
- change in specific occupation employment for a given ethnicity over time
- comparative analysis of employment trends between two ethnic/racial groups within a specific job speciality
- what else?
Database design process
To prepare our data so it is ready to be used in a database requires us to go through the following steps:
- Requirements analysis - what is the db used for, what is the data used for, what are the relationships among the data
- Conceptual design - using the requirements, we then model the database
- Logical design - map the conceptual design out into a real database with real tables