Text Files - As Data Storage (in Python)
Database Design
- Overview
- Opening text files
- Reading data from text files
- Writing to text files
- Character encoding issues
- Comma-Separated Values (CSV)
- Javascript Object Notation (JSON)
- The pandas in the room
- HyperText Markup Language (HTML)
- Data Munging
- Conclusions
Overview
Intro
Python programmers use a variety of techniques for manipulating data in plain text files. We will take a look at a few.
Opening text files
Modes of opening
A Pytyhon program can open a file in one of three modes:
-
read - the program will be able to read data from the file
-
write - the program will be able to completely overwrite the file’s data
-
append - the program will be able to add new data at the end of the file
It is possible to open a file in a combination of modes, but we will focus on each mode individually for simplicity.
open(…) function
Python’s built-in open()
function is the main tool for opening text files.
f = open('amazing_data.txt', 'r') # open the file in read mode
f = open('amazing_data.txt', 'w') # open the file in write mode
# if the file does not yet exist, it will be automatically created
f = open('amazing_data.txt', 'a') # open the file in append mode
# if the file does not yet exist, it will be automatically created
-
In each of these examples, the variable,
f
, will refer to the file that has been opened. -
Further operations can be done on the data in the file through this
f
variable.
Reading data from text files
Basic concept
There are multiple ways to read data from a file in Python. Which way you choose depends upon your needs.
-
f.read()
- returns the entire contents of the file as a string -
f.readline()
- returns the next available single line of text (including the line break) as a string -
f.readlines()
- returns a list containing all lines of the file as the values in the list -
for loop - can be used to iterate through the lines of a file
f.read()
Returns the entire contents of the file as a string.
f = open('amazing_data.txt', 'r') # open the file in read mode
all_text = f.read() # returns the entire contents of the file as a strings
# now do something fantastic and interesting with the data in all_text
f.readline()
Returns the next available single line of text (including the line break) as a string.
f = open('amazing_data.txt', 'r') # open the file in read mode
first_line = f.readline() # returns the first available line as a string (including line break)
second_line = f.readline() # returns the next available line as a string (including line break)
# ... and so on
# now do something fantastic and interesting with the data in first_line, second_line, etc.
Often, you want to remove the line break character from the end of each line.
f = open('amazing_data.txt', 'r') # open the file in read mode
first_line = f.readline() # returns the first available line as a string (including line break)
first_line = first_line.strip()
second_line = f.readline() # returns the next available line as a string (including line break)
second_line = second_line.strip()
# ... and so on
f.readlines()
Returns a list containing all lines of the file as the values in the list.
f = open('amazing_data.txt', 'r') # open the file in read mode
all_lines = f.readlines() # returns a *list* containing all lines of the file as the values in the list.
# now do something fantastic and interesting with the data in all_lines
for loops
It is possible to iterate through the lines of a file using a for loop.
f = open('amazing_data.txt', 'r') # open the file in read mode
for line in f:
line = line.strip() # remove the line break character
# now do something fantastic and interesting with the data in line
-
The loop will automatically terminate once it reaches the end of the file.
-
This is often the easiest way to iterate through each line of a file and do something with it
Writing data to text files
Basic concept
Files can be either written from scratch or appended to, depending on whether the file is opened in write or append mode.
-
f = open('amazing_data.txt', 'w') # open the file in write mode
-
f = open('amazing_data.txt', 'a') # open the file in append mode
-
In either mode, the
f.write()
function is used to write new text to the file. -
In either case, one must explicitly close the file with
f.close()
after writing to it.
Writing from scratch
In write mode, a file is created from scratch. If the file already exists, it will be completely overwritten.
f = open('amazing_data.txt', 'w') # open the file in write mode
f.write("'Twas brillig, and the slithy toves")
f.write("Did gyre and gimble in the wabe:")
f.write("All mimsy were the borogoves,")
f.write("And the mome raths outgrabe.")
Close the file at the end…. writing will not work without closing.
f.close() # close the file to finish the job
Appending
In append mode, an existing file is opened and added to. If the file does not yet exist, it will be automatically created.
f = open('amazing_data.txt', 'a') # open the file in append mode
f.write('"Beware the Jabberwock, my son!')
f.write('The jaws that bite, the claws that catch!')
f.write('Beware the Jubjub bird, and shun')
f.write('The frumious Bandersnatch!"')
Close the file at the end…. writing will not work without closing.
f.close() # close the file to finish the job
The file now contains the first two stanzas of Lewis Carroll’s Jabberwocky.
Encoding issues
Basic concept
Behind the scenes, plain text files contain only ASCII or Unicode codes, i.e. numbers in one of those encoding systems that represent characters.
- One day, you will open a file or web site that is supposed to show plain text. But you will see garbled characters and become confused.
ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ
-
One day you will write a program that is supposed to open and read the contents of a file. But it will crash and output something confusing about encodings.
-
There is one explanation: the program you have used to open the file is assuming the wrong encoding.
Determining a file’s encoding
Determining a file’s encoding can be a bit tricky, but there are a few common techniques.
-
The UNIX
file
command can be used to guess the encoding in a given file, althought it’s not always accurate, e.g. tryfile somefile.txt
and see what is output. -
Most applications (editors, web browsers, etc) allow you to select the encoding in which to open the file. Try a few until you find one that looks right!
-
The Python module,
charade
is designed to help detect a file’s character encoding. Runningcharade.detect(some_string)
will return the proposed encoding as well as a confidence level between 0 and 1 on how surecharade
is about its verdict.
Read more about the wonderful world of encodings here.
Dealing with encodings in Python
Python’s default encoding when opening a file depends on the computer it’s running on.
- UTF-8, one of several encodings in the Unicode standard, is often the encoding you want (but not always).
f = open('amazing_data.txt', 'r', encoding='utf_8') # open in read mode with the common UTF-8 Unicode encoding
- See the full set of character encodings supported by Python.
A program’s own encoding
Most computer programs are assumed to be written in ASCII. However, you may want to include non-ASCII characters in your code.
- At the top of a program, you can indicate the encoding used by the code itself.
# -*- coding: utf-8 -*-
# define a string with some non-ASCII characters... in this case Li Ba's poem, "Thoughts In Silent Night"
thoughts_in_silent_night = '''
李白《静夜思》
床前明月光,
疑是地上霜。
举头望明月,
低头思故乡。
'''
# print out the poem
print(thoughts_in_silent_night)
Comma-Separated Values (CSV)
Basic concept
Python’s csv module includes many functions useful for manipulating Comma-Separated Values (CSV) data.
Example
Take a CSV file, nonsense.csv
, with famous nonsense literature works:
Last name,First name,Title,Year
Carroll,Lewis,Jabberwocky,1871
Lear,Edward,The Jumblies,1910
Bishop,Elizabeth,The Man-Moth,1946
Let’s print out the title of each work using the csv
module:
import csv
f = open('nonsense.csv', 'r')
csv_reader = csv.DictReader(f)
for line in csv_reader:
print( line["Title"] )
Javascript Object Notation (JSON)
Basic concept
Python’s json module includes many functions useful for manipulating Javascript Ojbect Notation (JSON) data.
Example
Take a JSON file, nonsense.json
, with famous nonsense literature works:
[
{
"Last name": "Carroll",
"First name": "Lewis",
"Title": "Jaberwocky",
"Year": 1871
},
{
"Last name": "Lear",
"First name": "Edward",
"Title": "The Jumblies",
"Year": 1910
},
{
"Last name": "Bishop",
"First name": "Elizabeth",
"Title": "The Man-Moth",
"Year": 1946
}
]
Let’s print out the title of each work using the json
module:
import json
f = open('nonsense.json', 'r')
all_text = f.read() # get all text in file as a string
list_of_works = json.loads(all_text) # return a list of all works
# loop through each literature work
for work in list_of_works:
print( work["Title"] )
The pandas in the room
One module to rule them all
pandas is an exceptionally powerful library for data analysis.
- It allows for easy reading of both CSV and JSON data files, as well as most other common formats.
import pandas as pd
df = pd.read_csv('nonsense.csv')
print( df['Title'] ) # output all titles
import pandas as pd
df = pd.read_json('nonsense.json', orient='columns')
print( df['Title'] ) # output all titles
- pandas can do much more than this, but it cannot do everything.
HyperText Markup Language
Warning
While HyperText Markup Language (HTML) is one of the most common sources of loosely-structured and unstructured data today, it is difficult to parse manually due to its idiosyncratic syntax.
-
Don’t try to parse HTML manually… it’s too much work.
-
Use a readymade module such as Beautiful Soup instead.
Extraction with Beautiful Soup
Take, for example, the following HTML file named nonsense.html
:
<doctype html>
<html lang="en">
<head>
<title>Famous Works of Nonsense Literature</title>
</head>
<body>
<section>
<h1>Famous Works of Nonsense Literature</h1>
<article>
<h2>Jabberwocky</h2>
<p>by Lewis Carroll<br />1871</p>
</article>
<article>
<h2>The Jumblies</h2>
<p>by Edward Lear<br />1910</p>
</article>
<article>
<h2>The Man-Moth</h2>
<p>by Elizabeth Bishop<br />1946</p>
</article>
</section>
</body>
</html></doctype
>
Extraction with Beautiful Soup
The following Python program will extract the titles of each work of nonsense literature.
from bs4 import BeautifulSoup
f = open('nonsense.html', 'r') # open the file in read mode
contents = f.read() # returns the entire contents of the file as a string
soup = BeautifulSoup(contents, 'lxml') # use Beautiful Soup to parse the html
# find all h1 tags and iterate through them
for tag in soup.find_all('h2'):
# print out the contents of each h1 tag
print( tag.text )
Data Munging
Definition
Data Wrangling, sometimes referred to as Data Munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.
- Pay special attention to that last sentence!
Common issues
While the meaning and values in data sets vary wildly from one to another, the issues a data analyst encounters when first trying to use the data are usually quite predictable.
-
determining the character encoding - embarrassingly not as straightforward as one might expect
-
cleaning up data scraped from the web - e.g. unescaping html entities and escape characters, removing html code altogether from the data
-
deciding what to do about missing values - e.g. ignore them, invalidate the entire data series, fill them in with averages, etc
-
normalizing the data - e.g. standardize capitalization and standardize the text used to indicate the same value, such as ‘NYC’ and ‘New York City’
-
validating the data - i.e. does it pass the smell test and seem to have reasonable values and the expected range of values in each field
Conclusions
Thank you. Bye.