Using pandas for data analysis

From Knowledge Kitchen
Jump to navigation Jump to search


Module imports

Before working with pandas, it is typical to import the following three modules at the top of every pandas-enabled program:

  • pandas - the data analysis module
  • numpy - a foundational module of the scipy toolkit for dealing with numbers more efficiently than native Python data types
  • matplotlib.pyplot - a commonly-used module for making visualizations of data

These imports are usually aliased to make referring to them syntactically shorter:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Data import and analysis

Pandas is an extremely easy to use and powerful data analysis library available in Python. Pandas can natively read a variety of common data formats into a data type called a DataFrame that facilitates the analysis of that data.

In particular, pandas supports the following data formats for the source data:

  • CSV files - using the read_csv() method
  • JSON files - using the read_json() method
  • Excel files - using the read_excel() method
  • Fixed-width formatted text files - using the read_fwf() method
  • SQL Database Tables - using the read_sql() method
  • HTML documents - using the read_html() method

Some examples, assuming pandas has already been imported as pd:

# import data from a CSV file into a DataFrame
df = pd.read_csv('data.csv', skiprows=1) #set skiprows=None if you don't want to skip any rows in the file
# import data from SQL query results into a DataFrame, assuming a connection to a database has already been established as cxn
df = pd.read_sql('SELECT * FROM viking_metal_bands;", con=cxn)

Concepts

Using pandas requires that you understand the basics of two pandas-specific data types:

  • Series - an array or list of values
  • DataFrame - a collection of data in row/column form

Series

A series is a one-dimensional array of data. Any list or array can be converted into a pandas-specific Series. For example, the following code creates a Series from a list of values:

my_list = [1,3,5,np.nan,6,8]
pd.Series(my_list)

Note the use of numpy's nan (not a number) value to represent a null value. While in this case the data is hard-coded, np.nan is commonly used to represent missing data in an existing data set, a commonly-encountered situation. It is not used in aggregate calculations, such as averages, minimums, maximums, etc, which solves a big problem for data analysis.

DataFrames

A DataFrame is a pandas-specific data structure designed to be conceptually similar to a spreadsheet in that it represents data as a two-dimensional grid consisting of rows and columns.

  • The DataFrame provides a lot of methods to filter and analyze the data in the grid.
  • Every DataFrame has a field that serves as an index, similar to a primary key in a relational database table
  • Each column in the DataFrame is a Series
  • Each row in the DataFrame is also a Series
  • In other words, a DataFrame can be considered a sequence of Series that all share the same index

Creating a DataFrame from scratch

While it is common to import a data set from a CSV or SQL table into a DataFrame, it is also possible to create a DataFrame from scratch.

The following example code creates a DataFrame that looks like the table below, with a set of dates as the indexes, columns labeled as A, B, C, and D, and data consisting of 6 rows and 4 columns full of nonsense numbers.

                   A         B         C         D
2018-01-01 -1.969724  1.794276  1.534053  0.650349
2018-01-02 -0.186920  0.245413  0.499354  0.212531
2018-01-03 -0.085285  0.009889 -0.063703 -0.900914
2018-01-04 -1.521503  1.024530  0.633024  1.078262
2018-01-05 -0.956055 -1.310655  0.576833 -0.775409
2018-01-06 -0.955924 -2.775541 -0.845290  1.573933

The code to create this DataFrame with dummy data:

# automatically create a range of 6 different dates
dates = pd.date_range('20180101', periods=6)
print("Dummy dates:\n" + str(dates)) # see what it looks like

# create some dummy column names
column_names = list('ABCD') # column names will be ["A", "B", "C", "D"]
print("Dummy column names:\n" + str(column_names)) # see what it looks like

# automatically create 6 rows and 4 columns of dummy data
data = np.random.randn(6,4)
print("Dummy data:\n" + str(data)) # see what it looks like

# create a DataFrame with these dates as the indices, the letters A, B, C, and D as column names, and some random numbers as the data in each cell
df = pd.DataFrame(data, index=dates, columns=column_names)
print("The DataFrame:\n" + str(df)) #see what it looks like

Note the use of pandas to create a range of dates, and use of the numpy module to generate some randomly-generated dummy data.

Viewing data

head() and tail()

The head() and tail() methods of a DataFrame show the top or bottom batch of rows, respectively.

# returns the top 30 rows
df.head(30)
# returns the bottom 30 rows
df.tail(30)

index, columns, and values

# refers to the index used by the DataFrame
df.index
# refers to the columns in the DataFrame
df.columns
# refers to the data in the DataFrame cells
df.values

Statistics

describe()

# shows a quick statistical summary of the data in each column of the DataFrame, assuming it is numeric data
# includes min, max, mean, count, and standard deviation
df.describe()

Returns:

              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804

shape

# the shape property contains a tuple containing how many rows and columns are in the data set
df.shape

Evaluates to:

(6, 4)

info()

# the info() method returns some metadata including the data types and settings of each of the columns in the DataFrame
df.info()

Returns:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2018-01-01 to 2018-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    6 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes 

Transposing

Transposing data means flipping the axes such that rows become columns and columns become rows.

# generate a transposition of the data
df.T

Evaluates to:

   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988

Sorting

Sorting by values in a column:

# sort by the values in the B column, in ascending order by default
df.T

Evaluates to:

                   A         B         C         D
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

Example of important an existing data set

import pandas as pd
import numpy as np

#read a CSV file into a DataFrame structure, but ignoring the first four rows in the file
df = pd.read_csv('data.csv', skiprows=1) #set skiprows=None if you don't want to skip any rows in the file

#alternatively, you can read data from a SQL query into a DataFrame... e.g. 
#
#import pymysql
#
#connect to database
#cxn = pymysql.connect(
#    host="your_host_name", 
#    user="your_user_name", 
#    passwd="your_password", 
#    db="your_db_name", 
#    charset='utf8mb4',
#    cursorclass=pymysql.cursors.DictCursor) 
#
#convert the results of any SQL read query into a pandas DataFrame
#df = pd.read_sql('SELECT jobs.job_id, jobs.title, agencies.title AS agency, jobs.salary_range_from, jobs.salary_range_to FROM jobs, agencies WHERE jobs.agency_id=agencies.agency_id', con=cxn)

#return the first ten rows in the DataFrame
df.head(10) 

#return the last ten rows in the DataFrame
df.tail(10) 

#return the first 30 rows and last 30 rows within the DataFrame
df

#show how many rows and columns are in the DataFrame (as a tuple, which is like an immutable List)
df.shape

#get the number of rows and columns from the shape into separate variables
num_rows = df.shape[0]
num_cols = df.shape[1]

#get some metadata of what's inside the DataFrame
#useful for finding if there is any missing (null) data in the data and to understand the data types stored in each field of the DataFrame
df.info()

Series

A pandas Series is a one-dimensional array of indexed data.

  • Every value in a Series has an integer index that corresponds with it, like a List in Python
  • Pandas includes a host of methods for performing operations involving the index.
  • Each column in a DataFrame is stored as a Series.
  • Each row in a DataFrame is also itself a Series!

Accessing Series in a DataFrame

Series in a DataFrame can be accessed either with dot notation, or with square bracket notation.

Imagine you had row/column-like data in a DataFrame stored in a variable 'df', with one column labeled 'City':

df["City"] #this works for all Series names

evaluates the same Series as

df.City #dot notation only works for Series with no spaces in their name

It is also possible to access multiple Series in a DataFrame - put a list between the initial square brackets:

df[["City", "State"]] #this evaluates to a new DataFrame with just these two Series

Counting occurrences of values

The value_counts() method returns a reverse-ordered list of how many times a particular value occurs within a given Series.

For example, the following function call returns a list of many times each City is included within the City series, in descending order:

df["City"].value_counts()

To get the same list in ascending order:

df["City"].value_counts(ascending=True)

Sorting values

The sort_values() method returns a new Series with the values in the original Series in a sorted order.

For example, to sort the values in the City column in ascending order:

df["City"].sort_values()

Sorts can also be performed on a DataFrame itself, when a sort along a few columns is desired:

df.sort_values(by=["City", "State"])

Filtering

It is possible to have pandas return rows from a DataFrame that meet certain criteria

Finding rows that meet single criterion

Return a Series that shows the index of each row in the DataFrame and a True or False value for each indicating whether it meets the specific criterion:

df.City == "New York"

You can use this same expression to return a full DataFrame consisting only of those rows that meet the criterion:

df[df.City == "New York"]

Boolean logic operators

For muliple criteria, one must use Boolean logic. In pandas, rather than using the usual 'and', 'or', and 'not' boolean operators, we rely on the following Boolean operators

  • & - and equivalent
  • | - or equivalent
  • ~ - not equivalent

Finding rows that meet multiple criteria

Return a DataFrame of all rows that meet multiple criteria (using the 'and' operator in the criteria):

df[df.City == "Greece" & df.Population > 2000]

String operations

The pandas Series data structure includes a variety of methods related to Strings. In fact, all standard Python string methods can be used on strings in a pandas Series.

df["City"].str.contains("Croton")
df["State"].str.startswith("N")
df["City"].str.isnumeric("Population")


References


What links here