knowledge-kitchen

Pandas - Data Munging and Analysis In A Cuddly Form

Database Design

  1. Overview
  2. numpy
  3. Series
  4. Dataframe
  5. Visualizations
  6. Conclusions

Overview

Concept

pandas is an extraordinarily powerful data munging and analysis module, and is a key part of the Python-based scientific computing toolkit.

Features

Some of pandas’ most valuable features:

Numpy

Concept

pandas is built on top of numpy - a Python module for optimized handling of n-dimensional arrays of data. A basic understanding of numpy is thus helpful in understanding pandas.

Example Notebook

A basic understanding of numpy will help master pandas.

Installing

numpy and pandas, like most popular modules, can be installed via conda or pip package managers, preferably into a virtual environment.

pip install numpy # try 'pip3' instead of 'pip' if your system requires it
pip install pandas

Importing into a Python script with the np and pd aliases is the convention:

import numpy as np
import pandas as pd

Creating

ndarrays can be created from scratch with a variety of numpy functions.

Indexing and slicing

ndarrays in numpy can be indexed and sliced in the same manner as Python lists.

Take the following ndarray:

x = np.array( [10, 12, 14, 16, 18 ] )

Simple math operations

It is straightforward to perform to the same math operation across all values in an ndarray.

Basic statistics

numpy includes functions to perform basic statistics on any ndarray, such as calculating the min, max, mean, median, and standard deviation.

For example, take the following ndarray:

x = np.array([
    [ 2, 50, 100],
    [ 3, 60,  200],
    [ 4, 55, 150],
    [ 5, 40, 250]
])

Filtering

You may apply certain conditions to extract a subset of values from an ndarray.

Take the following ndarray:

a = np.array( [10, 12, 14, 16, 18 ] )

Removing null values

The value, np.nan represents a null value. And the function, np.isnan() can be helpful in finding null values in an array.

For example, take the folowing data:

x = np.array([np.nan, 1, 12, np.nan, 3, 41])

Series

Concept

A Series in pandas is a one-dimensional series of values, often representing the values in either a single row or a single column of a tabular data structure.

Examples

See this example Jupyter Notebook for examples exhibiting some of the core Series concepts.

Dataframe

Concept

A DataFrame is the main data type that users of pandas interact with.

Examples

See this example Jupyter Notebook for examples exhibiting some of the core DataFrame concepts.

Visualizations

Concept

pandas contains wrappers around the popular matplotlib plotting module, and includes several functions for creating several common types of plots:

Examples

See this example Jupyter Notebook for examples of data visualizations using pandas and matplotlib.

matplotlib examples

While it is not necessary to have a deep understanding of matplotlib in order to use pandas plotting functions, it might be helpful. Here is a sample Jupyter Notebook with some simple matplotlib examples that don’t use pandas.

Conclusions

Thank you. Bye.