Data munging in Python

From Knowledge Kitchen
Jump to navigation Jump to search


Some common data scrubbing tasks:

  • disambiguation of words and phrases
  • unescaping html entities in data scraped from the web
  • removing html code altogether from data scraped from the web
  • cleaning up text formatting

Disambiguation of words and phrases

The same data can be expressed in multiple ways. To standardize data and prepare it to be analyzed, very often you want to standardize (disambiguate) this so all data is expressed in the same format.

Example

Apostrophes are often used in real life text. This example shows how to replace a few common uses of apostrophes with their full-word counterparts, so you can be sure all words are written in full across all your data. The following code can easily standardize how certain phrases are written in a data set, so that analysis of those phrases is easier.

#imagine this is the data you've just scraped off of some social media web site
dirty_data = "You're your own best friend.  But Doug's awsme! #dougisawesome"

#create a dictionary of possible ambiguous apostrophies and their less ambiguous counterparts
APOSTROPHES = {"'s" : " is", "'re" : " are"} #there are many more apostrophy uses in both English slang and grammar that you'd have to  check for

#chop up the words
words = dirty_data.split()

#create a blank list to hold modified words
new_words = []

#loop through each word
for word in words:
    for key in APOSTROPHES.keys():
        if key in word:
            word = word.replace(key, APOSTROPHES[key])
    new_words.append(word)

#reassemble the string from the new words
clean_data = " ".join(new_words)

#to prove to you that we fixed the text:
print(clean_data)

You can see that this is just the beginning.... to fully standardize idiomatic speech needs much more than just a few search/replace options. For example, this code will not be able to properly disambiguate the term "Henry's" in the following two example sentences:

  • "Henry's dog bit him!"
  • "Henry's a dog who bites!"

Unescaping HTML entities

HTML entities are codes used in web pages to represent common symbols that otherwise have special meaning in HTML code. For example, the code &gt; represents a > sign, and &lt; represents a < The ampersand & character is represented as &amp;. There are many more common HTM entities you will come across if you scrape data from a website.

So you often want to replace these codes with the characters they are meant to represent.

Example

This example relies on Python's html.parser library. For more sophisticated HTML parsing, see the Beautiful Soup library.

#import Python's built-in HTML parser module
import html.parser 

#imagine this is the data you've just scraped off of some social media web site
dirty_data = "Awwwww &amp;lt;3 luv you Chris &amp; Dawn you&amp;rsquo;re awsm peeps. BabiesRAwesome, sooo happppppy :) #sooohappppppy"

#create a new parser object
html_parser = html.parser.HTMLParser()

#parse the tweet to get the unescaped version
clean_data = html_parser.unescape(dirty_data)

#to prove to you that we fixed the text:
print(clean_data)

Cleaning up text formatting

This is a more general task. Clean up means different things to different data. It all depends how your text is currently structured, and how you'd like it to end up.

The following examples work with a data file with historic surface temperature data available from NASA's Goddard Institute for Space Science.

Data file problems

The data file linked above has some classic problems that will need to be addressed when munging it:

  • several lines of text at the top of the file that do not contain data
  • inconsistent number of spaces between columns of data
  • there are a few missing data points, indicated by asterisks in the file (***)
  • repetition of column heading names every 20 lines or so
  • blank lines in the middle of the data set
  • several lines of text a the end of the file that do not contain data

The following examples solve for these problems in two different ways:

  • mostly custom code dealing with these problems
  • using pandas, a module that is built to do a lot of automated munging

Example using mostly custom code

This code parses a data file "the hard way", using only minimal assistance from ready-made modules. This example program relies on the csv and regular expressions libraries in Python and nothing more.

# this program is designed specifically to scrub any data found on NASA's site about average temperatures across the globe:
# the full set of data files is aviable here: http://data.giss.nasa.gov/gistemp/#tabledata
# the exact data file used in this example is located here: #we downloaded this file from here: http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt

# many of the same concepts can be applied to any text file that needs to be scrubbed.

# import regular expressions and csv libraries
# the regular expressions library allows us to easily find patterns in the text
# the csv library allows us to easily create a csv file
import re
import csv

#open a text file in read mode
data_file = open("GLB.Ts+dSST.txt", "r") 

#create a list that wil hold all rows in this file
all_rows = []

found_headings = False #flag that indicates whether we've found the column headings yet

#loop through each line in the file, one by one
for row in data_file:

    #print(line)

    #make sure this is a line we're interested in... we only want one row of headings, and then every row of data

    #check whether this line contains the column headings (which start with the word, "Year")
    if found_headings == False:        
        if row.find("Year") == 0:
            #we found the headings!
            found_headings = True
        elif found_headings == False:
            #otherwise, skip all lines till we find the headings
            continue
    else:
        #once we've found the headings already, only analyze rows that have numeric data
        #all rows with data start with a year, which is a number...

        #determine whether the first bit of data in the line is a number
        first_space_index = row.find(" ")
        first_word = row[0:first_space_index] #get the beginning of the line up to the first space

        #if this line doesn't start with a number, then skip it
        if not first_word.isnumeric():
            
            #move on to the next line
            continue
                

    #split the line according to a separator of one-or-more spaces
    #note that this is the re module's split method, not the usual string split method
    row_as_list = re.split(r"\s+", row)

    #append this line of data to the data we're keeping for the entire file
    all_rows.append(row_as_list)

#now you have all data in a big two-dimensional list
print(all_rows)

#close the file
data_file.close()


#you know how to make that into a CSV file now, right?


#CONVERT TO CSV

#open a new file to create a CSV (comma-separated values) text file.  CSVs can be easily imported into Excel, relational databases, or other python programs.
csvfile = open("output1.csv", "w")

#make a new csv "writer" object
writer = csv.writer(csvfile)

#write the two-dimensional list to the file as a CSV using the csv library
writer.writerows(all_rows)
    
#done!  ready to open this csv file in Excel or a relational database and do something interesting

Example using pandas

The following program performs the same task as the previous example, but uses the pandas data analysis library to drastically reduce the code.

# this program is designed specifically to scrub any data found on NASA's site about average temperatures across the globe:
# the full set of data files is aviable here: http://data.giss.nasa.gov/gistemp/#tabledata
# the exact data file used in this example is located here: #we downloaded this file from here: http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt

# many of the same concepts can be applied to any text file that needs to be scrubbed.

#import the pandas library
import pandas as pd

#every row with data in the NASA text file has very consistent "fixed-width" columns of data
#this list contains the beginning and ending string index numbers of every column
colspecs = [(0,4), (7,10), (12,15), (17,20), (22,25), (27,30), (32,35), (37,40), (42,45), (47,50), (52,55), (57,60), (62,65), (69,73), (73,76), (80,83), (85,88), (90,93), (95,98), (100,104)]

#create a pandas DataFrame from this file, relying on the fixed width columns listed above
df = pd.read_fwf("GLB.Ts+dSST.txt", delim_whitespace=True, colspecs=colspecs, skiprows=7, skipfooter=7, skip_blank_lines=True, comment="Year")

#now write the data to a nice CSV file
df.to_csv("output2.csv", sep=",", index=False)

#done!  ready to open this csv file in Excel or a relational database and do something interesting

Parsing HTML

The module BeautifulSoup is very useful for parsing HTML code.

#import the module
from bs4 import BeautifulSoup

#let's say you have an HTML file you scraped off the web...
html_doc = """
<!doctype html>
<html>
    <head>
        <title>Puppies For Sale</title>
    </head>
    <body>
        <div id="wrapper">
            <h1>Buy Your Puppy Today!</h1>
            <article>
                <h2>Earl</h2>
                <p>Italian Greyhound & Border Terrier Mix.  $5.99</p>
                <button>Buy now!</button>
            </article>
            <article>
                <h2>Jazmine</h2>
                <p>Pit Bull Terrier.  $6.49</p>
                <button>Buy now!</button>
            </article>
            <article>
                <h2>Tiny Tim</h2>
                <p>Australian Shepherd & Spaniel Mix.  $2.29</p>
                <button>Buy now!</button>
            </article>
        </div>
    </body>
</html>
"""

#...and you want to make some sense of it
soup = BeautifulSoup(html_doc, "lxml")

# print out the title of the document, for example
print("Searching", soup.title.string, "...")

#loop through all the <article> tags in the HTML code
for article in soup.find_all('article'):

    #get the name and description of each puppy
    puppy_name = article.h2.contents[0]
    puppy_desc = article.p.contents[0]

    #print out the puppy details
    print() #line break
    print(puppy_name)
    print(puppy_desc)

#etc...  check out more: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Creating a web crawling spider

Downloading content off the web is easy with the urllib.request library.

#example that crawls all i6 accounts and counts how many times a given word is mentioned
#note this might take a while, as there are a lot of i6 accounts!

import urllib.request

#get a list of all letters, and another list of all numbers
letters = "a b c d e f g h i j k l m n o p q r s t u v w x y z" .split(" ")
numbers = "0 1 2 3 4 5 6 7 8 9".split(" ")

#use these to generate all possible NYU Net IDs.  All NYU Net IDs are 2 letters followed by 4 numbers
net_ids = [] #a blank list that will hold all possible net ids

#nested for loops!
for l1 in letters:
    for l2 in letters:
        for n1 in numbers:
            for n2 in numbers:
                for n3 in numbers:
                    for n4 in numbers:
                        net_id = l1 + l2 + n1 + n2 + n3 + n4
                        net_ids.append(net_id)

#debugging
print(net_ids)

#count how many times the word "awesome" is used on i6 user home pages
page_counter = 0
awesome_counter = 0

#loop through all net ids and scrape their i6 pages
for net_id in net_ids:

    #make an http request for the web page of this person
    f = urllib.request.urlopen("http://i6.cims.nyu.edu/~" +  net_id)

    #get the html code for this page
    html_code = f.read()

    #increment our page counter
    page_counter = page_counter + 1

    #you can now do any kind of analysis on these pages here
    if "awesome" in html_code:
        #increment the awesome counter
        awesome_counter = awesome_counter + 1

        #debugging
        print("Found an awesome on " + net_id + "'s page.")


What links here