SQL - Joins (in SQLite)
Database Design
Overview
Data Set
Raw data
Everyone’s loves employment. Government jobs are rumoured to be stable, non-demanding, and flush with benefits.
Let’s take a look at New York City job openings.
-
As part of the open government movement, New York City publishes data sets exportable in a variety of formats, including SQLite’s favorite format -
CSV. -
Export as CSV and save into a file named
NYC_Jobs.csv.
Data munging
We will explore job postings and salary ranges. As usual, the data set needs a bit of munging before we can import it into a database and do our analysis.
-
Click to download a Python script named
munge.py. -
Place the Python file into the same directory as the
NYC_Jobs.csvfile. -
Run the Python script. It will produce two new files:
-
agencies_data.csv- a list of agencies within NYC government -
jobs_data.csv- information about each job opening in NYC government, including the agency to which it belongs -
We will import each of these CSV files into its own database table
-
Create tables
Start by setting up the two tables within SQLite to match the fields in the CSVs. First for jobs:
DROP TABLE IF EXISTS jobs;
CREATE TABLE jobs (
id INTEGER PRIMARY KEY,
job_id INTEGER NOT NULL,
agency_id INTEGER NOT NULL,
title TEXT NOT NULL,
salary_range_from INTEGER NOT NULL,
salary_range_to INTEGER NOT NULL
);
… and then for agencies:
DROP TABLE IF EXISTS agencies;
CREATE TABLE agencies (
id INTEGER PRIMARY KEY,
title TEXT NOT NULL
);
Import data
Import the data from each CSV file into its corresponding table.
-
First make sure SQLite is set to read CSVs with comma-separators:
.mode csv .separator "," -
import the jobs data into the
jobstable:
.import /Users/foobar/some_directory/jobs_data.csv jobs
- import the agencies data into the
agenciestable:
.import /Users/foobar/some_directory/agencies_data.csv agencies
Check the data
Before going any further, check that the data has imported successfully into SQLite.
Inspect the jobs tables:
.headers on
select * from jobs;
The output should appear something like:
| id | job_id | agency_id | title | salary_range_from | salary_range_to |
|---|---|---|---|---|---|
| 0 | 454437 | 0 | Executive Front Office Executive Operations Analyst | 46856 | 62480 |
| 1 | 386903 | 1 | Public Health Nurse, Bureau of School Health | 40.79 | 40.79 |
| 2 | 424454 | 2 | Desktop Support Technician | 60550 | 69632 |
| 3 | 373602 | 3 | HR PAYROLL ANALYST | 75000 | 115000 |
| … | … | … | … | … | … |
Check the data (continued)
Inspect the agencies tables:
select * from agencies;
The output should appear something like:
| id | title |
|---|---|
| 0 | OFFICE OF MANAGEMENT & BUDGET |
| 1 | DEPT OF HEALTH/MENTAL HYGIENE |
| 2 | DEPARTMENT OF CORRECTION |
| 3 | FINANCIAL INFO SVCS AGENCY |
| 4 | TAXI & LIMOUSINE COMMISSION |
| 5 | LAW DEPARTMENT |
| … | … |
Primary Keys and Foreign Keys
Primary keys
Every table has a primary key - a field (or fields) whose value act as a unique identifier for each record.
-
In the
jobstable, theidfield is always unique for every record and was designated as the primary key when the table was created. -
In the
agenciestable, theidfield is always unique for every record and was designated as the primary key when the table was created. -
Thus, each table has a primary key field.
Primary key terminology
There are a few terms to describe the different types of unique values that can be chosen as primary key:
-
A field in the data that could be used as a primary key is known as a candidate key.
-
An arbitrary value, given to records as primary key solely for the purpose of uniqueness - such as
1,2,3, etc… - is known as a surrogate key. -
A unique value used as primary key that has meaning outside of the database is known as a natural key.
-
A combination of two or more values that together are used as primary key is known as a composite key.
Relationships among multiple tables
In this example, as is often the case, we have multiple tables: jobs and agencies. And there is a relationship between the them.
-
Each job belongs to a particular agency.
-
The
agency_idfield of each job indicates the agency to which each job belongs. -
It does so not by naming the agency, but by giving the unique
idnumber of the relavant agency record. -
Thus, in order to see the full data about a job, we would have to pull the record from the
jobstable as well as the relevant record from theagenciestable. -
Such a query, merging data from two or more tables, is called a join.
Foreign keys
In order to perform a join to merge data from multiple tables, we must somehow inform the database how to match a record from one table with the related record in the other table.
-
In our example, the
jobstable contains theagency_idfield, which holds the uniqueidof the related record in theagenciestable. -
The
idfield of thejobstable is that table’s primary key. -
The
idfield of theagenciestable is that table’s primary key. -
The
agency_idfield of thejobstable is called a foreign key - a reference to the primary key of a record in another table.
Inner Joins
Merging records from two tables
There are several ways to merge - or join - records from one table to related records from another table. An inner join is the most common and perhaps simplest technique.
- Join all records in the
jobstable to their related records in theagenciestable:
select jobs.title, agencies.title from jobs inner join agencies on jobs.agency_id=agencies.id;
-
We have joined the tables by foreign key and primary key… a classic technique.
-
Notice that both tables have their own
idandtitlefields, and so we need to specifically prefix the field names with their table names, e.g.jobs.titleandagencies.title.
Results
The results of the inner join should look like a merged table including the fields we requested from both tables.
| jobs.title | agencies.title |
|---|---|
| Executive Front Office Executive Operations Analyst | OFFICE OF MANAGEMENT & BUDGET |
| Public Health Nurse, Bureau of School Health | DEPT OF HEALTH/MENTAL HYGIENE |
| Desktop Support Technician | DEPARTMENT OF CORRECTION |
| HR PAYROLL ANALYST | FINANCIAL INFO SVCS AGENCY |
| Administrative Assistant to the Assistant Commissioner of Safety & Emissions | TAXI & LIMOUSINE COMMISSION |
| … | … |
Referential integrity
An inner join maintains referential integrity. In practical terms, this means that if a particular job record refers to an agency that does not exist in the agencies table, then that job will not be included in our merged results.
- Try for yourself… let’s first look at the job #1 and its related agency.
select jobs.title, agencies.title from jobs inner join agencies on jobs.agency_id=agencies.id where jobs.id=1;
- Update this job record’s foreign key to refer to a non-existant agency.
update jobs set agency_id=999 where id=1;
- See what happens when you inner join this job to its non-existent agency.
select jobs.title, agencies.title from jobs inner join agencies on jobs.agency_id=agencies.id where jobs.id=1;
Left Joins
Concept
Left joins are similar to inner joins - they merge two related records from two different tables. However, whereas inner joins maintain referential integrity and do not show records with invalid relationships, left joins have no such integrity.
- Try a left join that replicates what we did with an inner join earlier on:
select jobs.id, jobs.title, agencies.title from jobs left join agencies on jobs.agency_id=agencies.id;
-
You will see all the jobs listed, with their corresponding agencies, just like in an inner join.
-
If you look at the job with the
id=1, which we updated to refer to a non-existant agency, you will see it listed - this is different from an inner join.
select jobs.id, jobs.title, agencies.title from jobs left join agencies on jobs.agency_id=agencies.id where jobs.id=1;
-
Thus left joins are not concerned with referential integrity.
-
The table mentioned first in the query,
jobsshows up on the left, with related records from the second table,agencies, attached to the right… thus the term, left join.
Checking for nothingness
This feature of left joins - that they do not maintain referential integrity - allows us to do something we cannot do with inner joins: we can check for non-existent relationships.
- For example, this query shows all jobs for which there is no matching agency:
select jobs.id, jobs.title, agencies.title from jobs left join agencies on jobs.agency_id=agencies.id WHERE agencies.id IS NULL;
Conclusions
Thank you. Bye.