Using pandas for data analysis
- 1 Module imports
- 2 Data import and analysis
- 3 Concepts
- 4 Viewing data
- 5 Statistics
- 6 Transposing
- 7 Sorting
- 8 Example of important an existing data set
- 9 Series
- 10 Counting occurrences of values
- 11 Sorting values
- 12 Filtering
- 13 String operations
- 14 References
- 15 What links here
Before working with pandas, it is typical to import the following three modules at the top of every pandas-enabled program:
- pandas - the data analysis module
- numpy - a foundational module of the scipy toolkit for dealing with numbers more efficiently than native Python data types
- matplotlib.pyplot - a commonly-used module for making visualizations of data
These imports are usually aliased to make referring to them syntactically shorter:
import pandas as pd import numpy as np import matplotlib.pyplot as plt
Data import and analysis
Pandas is an extremely easy to use and powerful data analysis library available in Python. Pandas can natively read a variety of common data formats into a data type called a DataFrame that facilitates the analysis of that data.
In particular, pandas supports the following data formats for the source data:
- CSV files - using the read_csv() method
- JSON files - using the read_json() method
- Excel files - using the read_excel() method
- Fixed-width formatted text files - using the read_fwf() method
- SQL Database Tables - using the read_sql() method
- HTML documents - using the read_html() method
Some examples, assuming pandas has already been imported as pd:
# import data from a CSV file into a DataFrame df = pd.read_csv('data.csv', skiprows=1) #set skiprows=None if you don't want to skip any rows in the file
# import data from SQL query results into a DataFrame, assuming a connection to a database has already been established as cxn df = pd.read_sql('SELECT * FROM viking_metal_bands;", con=cxn)
Using pandas requires that you understand the basics of two pandas-specific data types:
- Series - an array or list of values
- DataFrame - a collection of data in row/column form
A series is a one-dimensional array of data. Any list or array can be converted into a pandas-specific Series. For example, the following code creates a Series from a list of values:
my_list = [1,3,5,np.nan,6,8] pd.Series(my_list)
Note the use of numpy's nan (not a number) value to represent a null value. While in this case the data is hard-coded, np.nan is commonly used to represent missing data in an existing data set, a commonly-encountered situation. It is not used in aggregate calculations, such as averages, minimums, maximums, etc, which solves a big problem for data analysis.
A DataFrame is a pandas-specific data structure designed to be conceptually similar to a spreadsheet in that it represents data as a two-dimensional grid consisting of rows and columns.
- The DataFrame provides a lot of methods to filter and analyze the data in the grid.
- Every DataFrame has a field that serves as an index, similar to a primary key in a relational database table
- Each column in the DataFrame is a Series
- Each row in the DataFrame is also a Series
- In other words, a DataFrame can be considered a sequence of Series that all share the same index
Creating a DataFrame from scratch
While it is common to import a data set from a CSV or SQL table into a DataFrame, it is also possible to create a DataFrame from scratch.
The following example code creates a DataFrame that looks like the table below, with a set of dates as the indexes, columns labeled as A, B, C, and D, and data consisting of 6 rows and 4 columns full of nonsense numbers.
A B C D 2018-01-01 -1.969724 1.794276 1.534053 0.650349 2018-01-02 -0.186920 0.245413 0.499354 0.212531 2018-01-03 -0.085285 0.009889 -0.063703 -0.900914 2018-01-04 -1.521503 1.024530 0.633024 1.078262 2018-01-05 -0.956055 -1.310655 0.576833 -0.775409 2018-01-06 -0.955924 -2.775541 -0.845290 1.573933
The code to create this DataFrame with dummy data:
# automatically create a range of 6 different dates dates = pd.date_range('20180101', periods=6) print("Dummy dates:\n" + str(dates)) # see what it looks like # create some dummy column names column_names = list('ABCD') # column names will be ["A", "B", "C", "D"] print("Dummy column names:\n" + str(column_names)) # see what it looks like # automatically create 6 rows and 4 columns of dummy data data = np.random.randn(6,4) print("Dummy data:\n" + str(data)) # see what it looks like # create a DataFrame with these dates as the indices, the letters A, B, C, and D as column names, and some random numbers as the data in each cell df = pd.DataFrame(data, index=dates, columns=column_names) print("The DataFrame:\n" + str(df)) #see what it looks like
Note the use of pandas to create a range of dates, and use of the numpy module to generate some randomly-generated dummy data.
head() and tail()
The head() and tail() methods of a DataFrame show the top or bottom batch of rows, respectively.
# returns the top 30 rows df.head(30)
# returns the bottom 30 rows df.tail(30)
index, columns, and values
# refers to the index used by the DataFrame df.index
# refers to the columns in the DataFrame df.columns
# refers to the data in the DataFrame cells df.values
# shows a quick statistical summary of the data in each column of the DataFrame, assuming it is numeric data # includes min, max, mean, count, and standard deviation df.describe()
A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 -0.431125 -0.687758 -0.233103 std 0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804
# the shape property contains a tuple containing how many rows and columns are in the data set df.shape
# the info() method returns some metadata including the data types and settings of each of the columns in the DataFrame df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 6 entries, 2018-01-01 to 2018-01-06 Freq: D Data columns (total 4 columns): A 6 non-null float64 B 6 non-null float64 C 6 non-null float64 D 6 non-null float64 dtypes: float64(4) memory usage: 240.0 bytes
Transposing data means flipping the axes such that rows become columns and columns become rows.
# generate a transposition of the data df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
Sorting by values in a column:
# sort by the values in the B column, in ascending order by default df.T
A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
Example of important an existing data set
import pandas as pd import numpy as np #read a CSV file into a DataFrame structure, but ignoring the first four rows in the file df = pd.read_csv('data.csv', skiprows=1) #set skiprows=None if you don't want to skip any rows in the file #alternatively, you can read data from a SQL query into a DataFrame... e.g. # #import pymysql # #connect to database #cxn = pymysql.connect( # host="your_host_name", # user="your_user_name", # passwd="your_password", # db="your_db_name", # charset='utf8mb4', # cursorclass=pymysql.cursors.DictCursor) # #convert the results of any SQL read query into a pandas DataFrame #df = pd.read_sql('SELECT jobs.job_id, jobs.title, agencies.title AS agency, jobs.salary_range_from, jobs.salary_range_to FROM jobs, agencies WHERE jobs.agency_id=agencies.agency_id', con=cxn) #return the first ten rows in the DataFrame df.head(10) #return the last ten rows in the DataFrame df.tail(10) #return the first 30 rows and last 30 rows within the DataFrame df #show how many rows and columns are in the DataFrame (as a tuple, which is like an immutable List) df.shape #get the number of rows and columns from the shape into separate variables num_rows = df.shape num_cols = df.shape #get some metadata of what's inside the DataFrame #useful for finding if there is any missing (null) data in the data and to understand the data types stored in each field of the DataFrame df.info()
A pandas Series is a one-dimensional array of indexed data.
- Every value in a Series has an integer index that corresponds with it, like a List in Python
- Pandas includes a host of methods for performing operations involving the index.
- Each column in a DataFrame is stored as a Series.
- Each row in a DataFrame is also itself a Series!
Accessing Series in a DataFrame
Series in a DataFrame can be accessed either with dot notation, or with square bracket notation.
Imagine you had row/column-like data in a DataFrame stored in a variable 'df', with one column labeled 'City':
df["City"] #this works for all Series names
evaluates the same Series as
df.City #dot notation only works for Series with no spaces in their name
It is also possible to access multiple Series in a DataFrame - put a list between the initial square brackets:
df[["City", "State"]] #this evaluates to a new DataFrame with just these two Series
Counting occurrences of values
The value_counts() method returns a reverse-ordered list of how many times a particular value occurs within a given Series.
For example, the following function call returns a list of many times each City is included within the City series, in descending order:
To get the same list in ascending order:
The sort_values() method returns a new Series with the values in the original Series in a sorted order.
For example, to sort the values in the City column in ascending order:
Sorts can also be performed on a DataFrame itself, when a sort along a few columns is desired:
It is possible to have pandas return rows from a DataFrame that meet certain criteria
Finding rows that meet single criterion
Return a Series that shows the index of each row in the DataFrame and a True or False value for each indicating whether it meets the specific criterion:
df.City == "New York"
You can use this same expression to return a full DataFrame consisting only of those rows that meet the criterion:
df[df.City == "New York"]
Boolean logic operators
For muliple criteria, one must use Boolean logic. In pandas, rather than using the usual 'and', 'or', and 'not' boolean operators, we rely on the following Boolean operators
- & - and equivalent
- | - or equivalent
- ~ - not equivalent
Finding rows that meet multiple criteria
Return a DataFrame of all rows that meet multiple criteria (using the 'and' operator in the criteria):
df[df.City == "Greece" & df.Population > 2000]
The pandas Series data structure includes a variety of methods related to Strings. In fact, all standard Python string methods can be used on strings in a pandas Series.
- 10 minutes to pandas - https://pandas.pydata.org/pandas-docs/stable/10min.html
- Lynda.com / pandas Essential Training - https://www.lynda.com/Data-Science-tutorials/pandas-Essential-Training/636129-2.html