SQL - Intro (in SQLite)

Database Design

Overview
Setup
CRUD
Import data
Simple reads
Functions
Grouping
Combinations
Conclusions

Overview

Features

A small, lightweight, portable relational database:

zero configuration
serverless
simple file format
cross-platform compatible
manifest typing
small footprint
public domain source code
industry-proven

One user, one data file, one computer

SQLite is designed to be used by a single user with the data stored in a single file stored on a single computer.

This allows for a small footprint, ease of setup, and high data portability for small-to-medium-scale applications.
This makes SQLite unsuitable for large-scale multi-user applications with a large load of concurrent reading and writing of data. Other larger-footprint SQL-based relational database systems would be a better choice.

Scaling a database

As an application grows in usage and functionality, the chosen database system may no longer be suitable. There are two general strategies for “scaling up” a database system:

Vertical scaling (i.e. “scaling up”) - increasing the capacity of a single machine, such as by adding more memory, storage, or processing power.
Horizontal scaling (i.e. “scaling out”) - distributing the load across multiple machines, rather than just one. Two general approaches:
- replication - creating many separate duplicate instances of the database across multiple machines
- sharding - distributing the data across multiple machines, where those machines act together as a single database instance

Setup

Installation

How to install and run SQLite:

OS X comes with SQLite installed, although not necessarily the latest version.
To install or upgrade, install homebrew (Mac only) and use it to install sqlite, or download the latest Source code here - pick the “amalgamation” with “configure” script. Unzip the file and follow the instructions in the file named README.txt.

Start

Run from the command shell.

Replace database_name with a database name of your choosing:

sqlite3 database_name.db

If all goes swell, you are now in the SQLite command prompt.
And if a database with this name didn’t previously exist, it has now been created.

Quit

To quit SQLite, use the .quit helper function:

.quit

Now go and re-start again!

CRUD

Overview

Data operations fall into one of four types:

Create - creating new data
Read - viewing existing data
Update - updating existing data
Delete - deleting existing data

These operations are affectionately known as data CRUD.

Definition

“crud” definition from wordnik.com:

n. A coating or an incrustation of filth or refuse.

n. Something loathsome, despicable, or worthless.

Welcome to the world of data!

Create (tables)

Before any data can be added to the database, a table must be created using the create table command, including

the name of the table
a list of fields
the data type assigned to each field, e.g. INTEGER, REAL (floating point), NUMERIC, TEXT, or BLOB (no data type assigned),
an indication of which field is the primary key

We will write these and other CRUD commands in the SQL language, which is common among relational databases.

Create (tables… continued)

An example of a table to hold student information:

create table students (
  id INTEGER PRIMARY KEY,
  first_name TEXT,
  last_name TEXT,
  email TEXT,
  city TEXT,
  state TEXT,
  country TEXT,
  created DATETIME DEFAULT CURRENT_TIMESTAMP
);

The primary key setting specifies which field holds the unique identifier of each row - in this case the id field.
The current date and time will be automatically plugged into the created field with every new record.

Create rows manually

One can create new records in a table using the insert command

insert into students (first_name, last_name, email, city, state, country) values ("Foo", "Barstein", "fb1258@nyu.edu", "Dallas", "Texas", "United States");

Notice that the id value is not specified - it will be automatically supplied by the database if we don’t specify.
Notice also that the created value is not specified - it will be automatically supplied with the current date and time.

Create rows by importing a data file

One can import records from a file into a table. First set up the environment:

.mode csv
.headers on

SQLite natively imports CSV files… just make sure to turn on csv mode first to avoid potential problems.
the .headers on command tells SQLite to display the column headers once you read the data
JSON files are not natively supported and will have to be transformed into a supported format prior to import.
Download a CSV file for this example.

Create rows by importing a data file (continued)

Our students table has a created field that does not exist in the CSV data. SQLite will not import from a CSV file into a table whose fields do not exactly match the fields in the CSV data.

So let’s import the CSV into a temporary table that matches the CSV exactly:

create table temp_table (
  first_name TEXT,
  last_name TEXT,
  email TEXT,
  city TEXT,
  state TEXT,
  country TEXT
);

Then import the CSV into this temp_table, informing SQLite to skip the first row with headers in the file:

.import some_directory/students.csv temp_table --skip 1

Create rows by importing a data file (continued again)

Now, let’s copy the data from temp_table into our permanent students table:

INSERT INTO students
(first_name, last_name, email, city, state, country)
SELECT * FROM temp_table;

The CSV data has now been copied, with the automatic addition of the created field in the students table.

Now we can drop the temporary table.

DROP TABLE temp_table;

Read

Read operations are performed with the select command

select first_name from students where city="Grand Rapids";

The command, .mode csv will make the output look like a CSV file. .mode list will return to the default. .mode markdown will make the output appear as a valid Markdown table.
The command, .separator ";" will set the separator between fields to a semi-colon ; - change it to whatever character you prefer.
These two commands (.mode and .separator) can be useful for transforming the data into a format that is easily imported into another tools or programs.

Update

Update operations are performed with the update command

update students set first_name="Baz" where id=15;

Updates to the overall schema of the table can be done with the alter table command

alter table students add column photo blob;

Delete

Delete can be one of a few types:

Delete a particular record using the delete command

delete from students where first_name="Baz" and last_name="Trownson";

Drop an entire table using the drop table command

drop table students;

Import data

Find raw data

We will work with data on school districts from the 2010 US census This includes each district’s…

name
state
lowest grade taught
highest grade taught
population
number of households
land area
water area
geocoordinates (latitude & longitude)
… and a unique identifier for each row to serve as the primary key

We need a table in SQLite that will accommodate this.

Note: 2010 was the last year that population and housing unit counts where included in these “Gazetteer” data sets.

Create table

Create a table to accommodate the US Census school district data:

CREATE TABLE sd (
  sd_state TEXT,
  sd_geoid TEXT PRIMARY KEY,
  sd_name TEXT,
  sd_lowestGrade TEXT,
  sd_highestGrade TEXT,
  sd_pop_2010 INTEGER,
  sd_hu_2010 INTEGER,
  sd_aland REAL,
  sd_awater REAL,
  sd_aland_sqmi REAL,
  sd_awater_sqmi REAL,
  sd_intptlat REAL,
  sd_intptlong REAL
);

The primary key, the unique identifier of each record, is set to be the sd_geoid field, which represents unique locations in the US Census’ GEOID standard, and is not to be confused with the word geoid.

View list of tables

To view a list of any tables that exist in this database, use the .tables function.

.tables

Try it to make sure the table was created.

View table schema

Viewing the schema of the sd table shows the command used to create it:

.schema sd

Try it to make sure the table was created correctly.

Import data

Import from a comma-separated values (CSV) file of US census data on school districts into an SQLite table named sd… replace the path to the .csv file with one that works with your file system:

.mode csv
.import /Users/foobar/some_directory/school_districts.csv sd

Note that SQL cannot import if the fields in the table do not exactly match the fields in the CSV data.

To import in such cases, we must do a workaround:

Create a temporary table whose fields exactly match those of the CSV.
Import from the CSV file into this temporary table.
Copy the data from the temporary table into the permanent table.
Drop the temporary table.

Drop table

Delete the sd table:

drop table sd

Now go create it and import data into it again!

Use a script

Saving the commands to create and import data into an SQL script can save time and effort - you can re-create and re-import data at any time by simply executing the script.

Change the file path to whatever path works on your system:

.read /Users/foobar/some_directory/school_districts_setup.sql

An SQL script is simply a file with a series of SQL commands that will run when read into the database.

If you want to copy a database or share it with someone (or yourself), simply pass around a copy of the database file with the data in it.

And then quit and re-start SQLite using that database file. No need to create tables from scratch or import anything.

Quit

.quit

Restart

sqlite3 sd.db

Being a portable, small footprint, simple file format, cross-platform database system has its advantages.

Dot functions

You have no-doubt noticed our use of some dot functions, such as:

.quit - quit SQLite
.tables - view a list of tables in the database
.schema [table name] - show the schema of a table
.mode [mode] - change the way data is formatted on screen, e.g csv, list, markdown, and more
.headers [on or off] - whether to show the column headers above the data when displayed
.import [CSV filepath] [table name] - import a data file into a table
.read [SQL script filepath] - run an SQL script file

Replace the terms in square brackets with the real values (and delete the square brackets).

Simple reads

All school districts

Show all fields of all records…. there are a lot.

select * from sd;

The * here means all fields.

School districts in New York State

The where clause filters results to only those within New York…. there are a lot.

select * from sd where sd_state = "NY";

Show only school district names

Each records contains many fields. Limit to just the name of each school district, stored in a field named, sd_name:

select sd_name from sd where sd_state = "NY";

Multiple fields can be selected by supplying a comma-separated list, such as sd_name, sd_state.

Sort the results

Sort the records by school district name in reverse alphabetic order:

select sd_name from sd where sd_state = "NY" order by sd_name desc;

asc or desc can be used to indicate ascending or descending order.

Limit the results

Show only the first 5 in the reverse-alphabetic list of New York school districts:

select sd_name from sd where sd_state = "NY" order by sd_name desc limit 5;

the limit clause controls the number of results that are displayed.

Offset the results

Show the second page of results…

select sd_name from sd where sd_state = "NY" order by sd_name desc limit 5 offset 5;

… and the third page:

select sd_name from sd where sd_state = "NY" order by sd_name desc limit 5 offset 10;

the limit clause controls how many results are displayed
the offset clause indicates which row in the results to start from

Smallest school districts

The 10 smallest school districts in the country:

select sd_name, sd_state, sd_pop_2010 from sd order by sd_pop_2010 asc limit 10;

the order clause indicates by which field to order results
the asc option indicates to order in ascending order
the limit clause controls the number of results displayed

Functions

Concept

There are a variety of built-in aggregate functions available to be used in queries. These are used to calculate aggregate statistics from the records returned by any query.

These are similar to the statistical functions built into spreadsheet software:, SUM(), AVERAGE(), MIN(), MAX(), COUNT(), etc.

Counting records

Calculate how many school districts exist in the US and Puerto Rico using the count() function.

select count(sd_geoid) from sd;

The COUNT() function returns the number of values in the field indicated as an argument.
The primary key field, sd_geoid, is guaranteed to be a unique for each record, while other fields may not be.

Counting records in a filtered result set

Count the number of school districts in New York State by filtering results with a where clause.

select count(sd_geoid) from sd where sd_state = "NY";

We have thus applied a filter - the COUNT() function will only count those rows that match the criteria indicated in the WHERE clause.

Calculating average

Calculate the average population of school districts across the US, using the avg() function.

select avg(sd_pop_2010) from sd;

The AVG() function will return a floating point number.
In actuality, there are no half people or 1/4 people. So we could round the result using the round() function, as in round( avg(sd_pop_2010) ).

Calculating sum

Calculate the total population in the country, using the sum() function.

select sum(sd_pop_2010) from sd;

And, of course, we could calculate the total population of just Minnesota by filtering this query with a WHERE clause:

select sum(sd_pop_2010) from sd where sd_state = "MN";

Grouping

Concept

Grouping records together by some attributes they share in common allows us to calculate aggregate statistics.

Counting school districts per state

Counting the number of school districts per state requires us to group together all school districts in each state:

select count(sd_geoid) from sd group by sd_state;

Thus we don’t count the number of unique geoids in the entire table. This performs a separate count for each group.

Counting school districts per state… better

The previous query showed us the numbers, but not which state they represented.

View the state name and the number of school districts in each state:

select sd_state, count(sd_geoid) from sd group by sd_state;

Counting school districts per state in descending order

The state name and the number of school districts in each state, in descending order:

select sd_state, count(sd_geoid) from sd group by sd_state order by count(sd_geoid) desc;

Note that the clause, order by count(sd_geoid) treats the count as if it were like any other field to sort by. We can make this less redundant to write by using an alias for the result of the count:

select sd_state, count(sd_geoid) as num_districts from sd group by sd_state order by num_districts desc;

Calculating the 3 most populous states

Sort states by aggregate population in descending order, then limit to the first 3 records.

select sd_state, sum(sd_pop_2010) as pop from sd group by sd_state order by pop desc limit 3;

Calculating the 3 least populous states

Sort states by aggregate population in ascending order, then limit to the first 3 records.

select sd_state, sum(sd_pop_2010) as pop from sd group by sd_state order by pop asc limit 3;

Combinations

Concept

Let’s take what we’ve learned a step further.

All of New England

New England is made of several states. Let’s find the top ten largest school districts in this combined region using an in operator:

select sd_name, sd_state, sd_pop_2010 from sd where sd_state in ("ME", "VT", "NH", "MA", "RI") ORDER BY sd_pop_2010 desc limit 10;

Highest population density school districts

Which school districts have the highest population density?

select sd_name, sd_state, round(sd_pop_2010 / sd_aland_sqmi) as density from sd order by density desc limit 10;

Lowest population density school districts

Which school districts have the lowest population density?

select sd_name, sd_state, round(sd_pop_2010 / sd_aland_sqmi) as density from sd order by density asc limit 10;

This query will return a bunch of school districts with the name including the text, “School District Not Defined”… let’s exclude these:

select sd_name, sd_state, round(sd_pop_2010 / sd_aland_sqmi) as density from sd where sd_name not like "%not defined%" order by density asc limit 10;

Highest population density states

Which states have the highest population density?

select sd_state, round( sum(sd_pop_2010) / sum(sd_aland_sqmi) ) as density from sd group by sd_state order by density desc limit 10;

Lowest population density states

Which states have the lowest population density?

select sd_state, round( sum(sd_pop_2010) / sum(sd_aland_sqmi) ) as density from sd group by sd_state order by density asc limit 10;

Conclusions

Thank you. Bye.

SQL - Intro (in SQLite)

Overview

Features

One user, one data file, one computer

Scaling a database

Setup

Installation

Start

Quit

CRUD

Overview

Definition

Create (tables)

Create (tables… continued)

Create rows manually

Create rows by importing a data file

Create rows by importing a data file (continued)

Create rows by importing a data file (continued again)

Read

Update

Delete

Import data

Find raw data

Create table

View list of tables

View table schema

Import data

Drop table

Use a script

Share a database

Dot functions

Simple reads

All school districts

School districts in New York State

Show only school district names

Sort the results

Limit the results

Offset the results

Smallest school districts

Functions

Concept

Counting records

Counting records in a filtered result set

Calculating average

Calculating sum

Grouping

Concept

Counting school districts per state

Counting school districts per state… better

Counting school districts per state in descending order

Calculating the 3 most populous states

Calculating the 3 least populous states

Combinations

Concept

All of New England

Highest population density school districts

Lowest population density school districts

Highest population density states

Lowest population density states

Conclusions