knowledge-kitchen

Text Files - As Data Storage (in Python)

Database Design

  1. Overview
  2. Opening text files
  3. Reading data from text files
  4. Writing to text files
  5. Character encoding issues
  6. Comma-Separated Values (CSV)
  7. Javascript Object Notation (JSON)
  8. The pandas in the room
  9. HyperText Markup Language (HTML)
  10. Data Munging
  11. Conclusions

Overview

Intro

Python programmers use a variety of techniques for manipulating data in plain text files. We will take a look at a few.

Opening text files

Modes of opening

A Pytyhon program can open a file in one of three modes:

It is possible to open a file in a combination of modes, but we will focus on each mode individually for simplicity.

open(…) function

Python’s built-in open() function is the main tool for opening text files.

f = open('amazing_data.txt', 'r') # open the file in read mode
f = open('amazing_data.txt', 'w') # open the file in write mode
# if the file does not yet exist, it will be automatically created
f = open('amazing_data.txt', 'a') # open the file in append mode
# if the file does not yet exist, it will be automatically created

Reading data from text files

Basic concept

There are multiple ways to read data from a file in Python. Which way you choose depends upon your needs.

f.read()

Returns the entire contents of the file as a string.

f = open('amazing_data.txt', 'r') # open the file in read mode

all_text = f.read() # returns the entire contents of the file as a strings

# now do something fantastic and interesting with the data in all_text

f.readline()

Returns the next available single line of text (including the line break) as a string.

f = open('amazing_data.txt', 'r') # open the file in read mode

first_line = f.readline() # returns the first available line as a string (including line break)
second_line = f.readline() # returns the next available line as a string (including line break)
# ... and so on

# now do something fantastic and interesting with the data in first_line, second_line, etc.

Often, you want to remove the line break character from the end of each line.

f = open('amazing_data.txt', 'r') # open the file in read mode

first_line = f.readline() # returns the first available line as a string (including line break)
first_line = first_line.strip()

second_line = f.readline() # returns the next available line as a string (including line break)
second_line = second_line.strip()
# ... and so on

f.readlines()

Returns a list containing all lines of the file as the values in the list.

f = open('amazing_data.txt', 'r') # open the file in read mode

all_lines = f.readlines() # returns a *list* containing all lines of the file as the values in the list.

# now do something fantastic and interesting with the data in all_lines

for loops

It is possible to iterate through the lines of a file using a for loop.

f = open('amazing_data.txt', 'r') # open the file in read mode

for line in f:
    line = line.strip() # remove the line break character

    # now do something fantastic and interesting with the data in line

Writing data to text files

Basic concept

Files can be either written from scratch or appended to, depending on whether the file is opened in write or append mode.

Writing from scratch

In write mode, a file is created from scratch. If the file already exists, it will be completely overwritten.

f = open('amazing_data.txt', 'w') # open the file in write mode
f.write("'Twas brillig, and the slithy toves")
f.write("Did gyre and gimble in the wabe:")
f.write("All mimsy were the borogoves,")
f.write("And the mome raths outgrabe.")

Close the file at the end…. writing will not work without closing.

f.close() # close the file to finish the job

Appending

In append mode, an existing file is opened and added to. If the file does not yet exist, it will be automatically created.

f = open('amazing_data.txt', 'a') # open the file in append mode
f.write('"Beware the Jabberwock, my son!')
f.write('The jaws that bite, the claws that catch!')
f.write('Beware the Jubjub bird, and shun')
f.write('The frumious Bandersnatch!"')

Close the file at the end…. writing will not work without closing.

f.close() # close the file to finish the job

The file now contains the first two stanzas of Lewis Carroll’s Jabberwocky.

Encoding issues

Basic concept

Behind the scenes, plain text files contain only ASCII or Unicode codes, i.e. numbers in one of those encoding systems that represent characters.

ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ

Determining a file’s encoding

Determining a file’s encoding can be a bit tricky, but there are a few common techniques.

Read more about the wonderful world of encodings here.

Dealing with encodings in Python

Python’s default encoding when opening a file depends on the computer it’s running on.

f = open('amazing_data.txt', 'r', encoding='utf_8') # open in read mode with the common UTF-8 Unicode encoding

A program’s own encoding

Most computer programs are assumed to be written in ASCII. However, you may want to include non-ASCII characters in your code.

# -*- coding: utf-8 -*-

# define a string with some non-ASCII characters... in this case Li Ba's poem, "Thoughts In Silent Night"
thoughts_in_silent_night = '''
李白《静夜思》
床前明月光,
疑是地上霜。
举头望明月,
低头思故乡。
'''

# print out the poem
print(thoughts_in_silent_night)

Comma-Separated Values (CSV)

Basic concept

Python’s csv module includes many functions useful for manipulating Comma-Separated Values (CSV) data.

Example

Take a CSV file, nonsense.csv, with famous nonsense literature works:

Last name,First name,Title,Year
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946

Let’s print out the title of each work using the csv module:

import csv

f = open('nonsense.csv', 'r')
csv_reader = csv.DictReader(f)

for line in csv_reader:
    print( line["Title"] )

Javascript Object Notation (JSON)

Basic concept

Python’s json module includes many functions useful for manipulating Javascript Ojbect Notation (JSON) data.

Example

Take a JSON file, nonsense.json, with famous nonsense literature works:

[
  {
    "Last name": "Carroll",
    "First name": "Lewis",
    "Title": "Jaberwocky",
    "Year": 1871
  },
  {
    "Last name": "Lear",
    "First name": "Edward",
    "Title": "The Jumblies",
    "Year": 1910
  },
  {
    "Last name": "Bishop",
    "First name": "Elizabeth",
    "Title": "The Man-Moth",
    "Year": 1946
  }
]

Let’s print out the title of each work using the json module:

import json

f = open('nonsense.json', 'r')
all_text = f.read() # get all text in file as a string

list_of_works = json.loads(all_text) # return a list of all works

# loop through each literature work
for work in list_of_works:
    print( work["Title"] )

The pandas in the room

One module to rule them all

pandas is an exceptionally powerful library for data analysis.

import pandas as pd
df = pd.read_csv('nonsense.csv')
print( df['Title'] ) # output all titles
import pandas as pd
df = pd.read_json('nonsense.json', orient='columns')
print( df['Title'] ) # output all titles

HyperText Markup Language

Warning

While HyperText Markup Language (HTML) is one of the most common sources of loosely-structured and unstructured data today, it is difficult to parse manually due to its idiosyncratic syntax.

Extraction with Beautiful Soup

Take, for example, the following HTML file named nonsense.html:

<doctype html>
  <html lang="en">
    <head>
      <title>Famous Works of Nonsense Literature</title>
    </head>
    <body>
      <section>
        <h1>Famous Works of Nonsense Literature</h1>
        <article>
          <h2>Jabberwocky</h2>
          <p>by Lewis Carroll<br />1871</p>
        </article>
        <article>
          <h2>The Jumblies</h2>
          <p>by Edward Lear<br />1910</p>
        </article>
        <article>
          <h2>The Man-Moth</h2>
          <p>by Elizabeth Bishop<br />1946</p>
        </article>
      </section>
    </body>
  </html></doctype
>

Extraction with Beautiful Soup

The following Python program will extract the titles of each work of nonsense literature.

from bs4 import BeautifulSoup

f = open('nonsense.html', 'r') # open the file in read mode
contents = f.read() # returns the entire contents of the file as a string
soup = BeautifulSoup(contents, 'lxml') # use Beautiful Soup to parse the html

# find all h1 tags and iterate through them
for tag in soup.find_all('h2'):

    # print out the contents of each h1 tag
    print( tag.text )

Data Munging

Definition

Data Wrangling, sometimes referred to as Data Munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

-Wikipedia

Common issues

While the meaning and values in data sets vary wildly from one to another, the issues a data analyst encounters when first trying to use the data are usually quite predictable.

Conclusions

Thank you. Bye.