knowledge-kitchen

SQL - Intro (in SQLite)

Database Design

  1. Overview
  2. Setup
  3. CRUD
  4. Import data
  5. Simple reads
  6. Functions
  7. Grouping
  8. Combinations
  9. Conclusions

Overview

Features

A small, lightweight, portable relational database:

One user, one data file, one computer

SQLite is designed to be used by a single user with the data stored in a single file stored on a single computer.

Scaling a database

As an application grows in usage and functionality, the chosen database system may no longer be suitable. There are two general strategies for “scaling up” a database system:

Setup

Installation

How to install and run SQLite:

Start

Run from the command shell.

Replace database_name with a database name of your choosing:

sqlite3 database_name.db

Quit

To quit SQLite, use the .quit helper function:

.quit

Now go and re-start again!

CRUD

Overview

Data operations fall into one of four types:

These operations are affectionately known as data CRUD.

Definition

“crud” definition from wordnik.com:

n. A coating or an incrustation of filth or refuse.

n. Something loathsome, despicable, or worthless.

Welcome to the world of data!

Create (tables)

Before any data can be added to the database, a table must be created using the create table command, including

We will write these and other CRUD commands in the SQL language, which is common among relational databases.

Create (tables… continued)

An example of a table to hold student information:

create table students (
  id INTEGER PRIMARY KEY,
  first_name TEXT,
  last_name TEXT,
  email TEXT,
  city TEXT,
  state TEXT,
  country TEXT,
  created DATETIME DEFAULT CURRENT_TIMESTAMP
);

Create rows manually

One can create new records in a table using the insert command

insert into students (first_name, last_name, email, city, state, country) values ("Foo", "Barstein", "fb1258@nyu.edu", "Dallas", "Texas", "United States");

Create rows by importing a data file

One can import records from a file into a table. First set up the environment:

.mode csv
.headers on

Create rows by importing a data file (continued)

Our students table has a created field that does not exist in the CSV data. SQLite will not import from a CSV file into a table whose fields do not exactly match the fields in the CSV data.

So let’s import the CSV into a temporary table that matches the CSV exactly:

create table temp_table (
  first_name TEXT,
  last_name TEXT,
  email TEXT,
  city TEXT,
  state TEXT,
  country TEXT
);

Then import the CSV into this temp_table, informing SQLite to skip the first row with headers in the file:

.import some_directory/students.csv temp_table --skip 1

Create rows by importing a data file (continued again)

Now, let’s copy the data from temp_table into our permanent students table:

INSERT INTO students
(first_name, last_name, email, city, state, country)
SELECT * FROM temp_table;

The CSV data has now been copied, with the automatic addition of the created field in the students table.

Now we can drop the temporary table.

DROP TABLE temp_table;

Read

Read operations are performed with the select command

select first_name from students where city="Grand Rapids";

Update

Update operations are performed with the update command

update students set first_name="Baz" where id=15;

Updates to the overall schema of the table can be done with the alter table command

alter table students add column photo blob;

Delete

Delete can be one of a few types:

delete from students where first_name="Baz" and last_name="Trownson";
drop table students;

Import data

Find raw data

We will work with data on school districts from the 2010 US census This includes each district’s…

We need a table in SQLite that will accommodate this.

Create table

Create a table to accommodate the US Census school district data:

CREATE TABLE sd (
  sd_state TEXT,
  sd_geoid TEXT PRIMARY KEY,
  sd_name TEXT,
  sd_lowestGrade TEXT,
  sd_highestGrade TEXT,
  sd_pop_2010 INTEGER,
  sd_hu_2010 INTEGER,
  sd_aland REAL,
  sd_awater REAL,
  sd_aland_sqmi REAL,
  sd_awater_sqmi REAL,
  sd_intptlat REAL,
  sd_intptlong REAL
);

The primary key, the unique identifier of each record, is set to be the sd_geoid field, which represents unique locations in the US Census’ GEOID standard, and is not to be confused with the word geoid.

View list of tables

To view a list of any tables that exist in this database, use the .tables function.

.tables

View table schema

Viewing the schema of the sd table shows the command used to create it:

.schema sd

Import data

Import from a comma-separated values (CSV) file of US census data on school districts into an SQLite table named sd… replace the path to the .csv file with one that works with your file system:

.mode csv
.import /Users/foobar/some_directory/school_districts.csv sd

Note that SQL cannot import if the fields in the table do not exactly match the fields in the CSV data.

To import in such cases, we must do a workaround:

  1. Create a temporary table whose fields exactly match those of the CSV.
  2. Import from the CSV file into this temporary table.
  3. Copy the data from the temporary table into the permanent table.
  4. Drop the temporary table.

Drop table

Delete the sd table:

drop table sd

Now go create it and import data into it again!

Use a script

Saving the commands to create and import data into an SQL script can save time and effort - you can re-create and re-import data at any time by simply executing the script.

Change the file path to whatever path works on your system:

.read /Users/foobar/some_directory/school_districts_setup.sql

Share a database

If you want to copy a database or share it with someone (or yourself), simply pass around a copy of the database file with the data in it.

And then quit and re-start SQLite using that database file. No need to create tables from scratch or import anything.

Quit

.quit

Restart

sqlite3 sd.db

Being a portable, small footprint, simple file format, cross-platform database system has its advantages.

Dot functions

You have no-doubt noticed our use of some dot functions, such as:

Replace the terms in square brackets with the real values (and delete the square brackets).

Simple reads

All school districts

Show all fields of all records…. there are a lot.

select * from sd;

School districts in New York State

The where clause filters results to only those within New York…. there are a lot.

select * from sd where sd_state = "NY";

Show only school district names

Each records contains many fields. Limit to just the name of each school district, stored in a field named, sd_name:

select sd_name from sd where sd_state = "NY";

Multiple fields can be selected by supplying a comma-separated list, such as sd_name, sd_state.

Sort the results

Sort the records by school district name in reverse alphabetic order:

select sd_name from sd where sd_state = "NY" order by sd_name desc;

Limit the results

Show only the first 5 in the reverse-alphabetic list of New York school districts:

select sd_name from sd where sd_state = "NY" order by sd_name desc limit 5;

Offset the results

Show the second page of results…

select sd_name from sd where sd_state = "NY" order by sd_name desc limit 5 offset 5;

… and the third page:

select sd_name from sd where sd_state = "NY" order by sd_name desc limit 5 offset 10;

Smallest school districts

The 10 smallest school districts in the country:

select sd_name, sd_state, sd_pop_2010 from sd order by sd_pop_2010 asc limit 10;

Functions

Concept

There are a variety of built-in aggregate functions available to be used in queries. These are used to calculate aggregate statistics from the records returned by any query.

Counting records

Calculate how many school districts exist in the US and Puerto Rico using the count() function.

select count(sd_geoid) from sd;

Counting records in a filtered result set

Count the number of school districts in New York State by filtering results with a where clause.

select count(sd_geoid) from sd where sd_state = "NY";

Calculating average

Calculate the average population of school districts across the US, using the avg() function.

select avg(sd_pop_2010) from sd;

Calculating sum

Calculate the total population in the country, using the sum() function.

select sum(sd_pop_2010) from sd;

And, of course, we could calculate the total population of just Minnesota by filtering this query with a WHERE clause:

select sum(sd_pop_2010) from sd where sd_state = "MN";

Grouping

Concept

Grouping records together by some attributes they share in common allows us to calculate aggregate statistics.

Counting school districts per state

Counting the number of school districts per state requires us to group together all school districts in each state:

select count(sd_geoid) from sd group by sd_state;

Counting school districts per state… better

The previous query showed us the numbers, but not which state they represented.

View the state name and the number of school districts in each state:

select sd_state, count(sd_geoid) from sd group by sd_state;

Counting school districts per state in descending order

The state name and the number of school districts in each state, in descending order:

select sd_state, count(sd_geoid) from sd group by sd_state order by count(sd_geoid) desc;

Note that the clause, order by count(sd_geoid) treats the count as if it were like any other field to sort by. We can make this less redundant to write by using an alias for the result of the count:

select sd_state, count(sd_geoid) as num_districts from sd group by sd_state order by num_districts desc;

Calculating the 3 most populous states

Sort states by aggregate population in descending order, then limit to the first 3 records.

select sd_state, sum(sd_pop_2010) as pop from sd group by sd_state order by pop desc limit 3;

Calculating the 3 least populous states

Sort states by aggregate population in ascending order, then limit to the first 3 records.

select sd_state, sum(sd_pop_2010) as pop from sd group by sd_state order by pop asc limit 3;

Combinations

Concept

Let’s take what we’ve learned a step further.

All of New England

New England is made of several states. Let’s find the top ten largest school districts in this combined region using an in operator:

select sd_name, sd_state, sd_pop_2010 from sd where sd_state in ("ME", "VT", "NH", "MA", "RI") ORDER BY sd_pop_2010 desc limit 10;

Highest population density school districts

Which school districts have the highest population density?

select sd_name, sd_state, round(sd_pop_2010 / sd_aland_sqmi) as density from sd order by density desc limit 10;

Lowest population density school districts

Which school districts have the lowest population density?

select sd_name, sd_state, round(sd_pop_2010 / sd_aland_sqmi) as density from sd order by density asc limit 10;

This query will return a bunch of school districts with the name including the text, “School District Not Defined”… let’s exclude these:

select sd_name, sd_state, round(sd_pop_2010 / sd_aland_sqmi) as density from sd where sd_name not like "%not defined%" order by density asc limit 10;

Highest population density states

Which states have the highest population density?

select sd_state, round( sum(sd_pop_2010) / sum(sd_aland_sqmi) ) as density from sd group by sd_state order by density desc limit 10;

Lowest population density states

Which states have the lowest population density?

select sd_state, round( sum(sd_pop_2010) / sum(sd_aland_sqmi) ) as density from sd group by sd_state order by density asc limit 10;

Conclusions

Thank you. Bye.