Summarize
Concepts Covered
- To understand the basics of pandas and their basic operations.
- Pandas Library Data Frames, CSV file reading, Pandas basic file data operation.
The pandas package is the most important tool at the disposal of Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.
[pandas] is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia]
What’s Pandas for?
Pandas has so many uses that it might make sense to list the things it can’t do instead of what it can do.
This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.
CSV: A csv’s full form is comma separated values so CSV is just like a spreadsheet.
For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:
- Calculate statistics and answer questions about the data, like
1. What's the average, median, max, or min of each column? 2. Does column A correlate with column B?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
- Store the cleaned, transformed data back into a CSV, other file or database
Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best library through which to do that.
How does pandas fit into the data science toolkit?
Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
Pandas First Steps(Optional if using offline ide like Jupyter then only.)
The pandas package is the most important tool at the disposal of Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.
[pandas] is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia]
What’s Pandas for?
Pandas has so many uses that it might make sense to list the things it can’t do instead of what it can do.
This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.
CSV: A csv’s full form is comma separated values so CSV is just like a spreadsheet.
For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:
- Calculate statistics and answer questions about the data, like
1. What's the average, median, max, or min of each column? 2. Does column A correlate with column B?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
- Store the cleaned, transformed data back into a CSV, other file or database
Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best library through which to do that.
How does pandas fit into the data science toolkit?
Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
Pandas First Steps(Optional if using offline ide like Jupyter then only.)
Install and import
Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:
In [ ]:
!pip install pandas # This is when you are using Pandas in local system not on google colab.
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9) Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
The !
at the beginning runs cells as if they were in a terminal.
To import pandas we usually import it with a shorter name since it’s used so much:
In [ ]:
import pandas as pd
Now to the basic components of pandas.
Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series
and DataFrame
.
A Series
is essentially a column, and a DataFrame
is a multi-dimensional table made up of a collection of Series.
DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.
You’ll see how these components work when we start working with data below.
Creating DataFrames from scratch
Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.
There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict
.
Let’s say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:
In [ ]:
import pandas as pd
In [ ]:
data = { 'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2] }
And then pass it to the pandas DataFrame constructor:
In [ ]:
purchases = pd.DataFrame(data) purchases
Out[ ]:
apples | oranges | |
---|---|---|
0 | 3 | 0 |
1 | 2 | 3 |
2 | 0 | 7 |
3 | 1 | 2 |
How did that work?
Each (key, value) item in data
corresponds to a column in the resulting DataFrame.
The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.
Let’s have customer names as our index:
In [ ]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David']) purchases
Out[ ]:
apples | oranges | |
---|---|---|
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
So now we could locate a customer’s order by using their name:
In [ ]:
purchases.loc['June']
Out[ ]:
apples 3 oranges 0 Name: June, dtype: int64
There’s more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.
Let’s move on to some quick methods for creating DataFrames from various other sources.
How to read in data
It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.
In [ ]:
from google.colab import drive # Note: (To understand this part better please check the session document) drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
from google.colab import drive drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading data from CSVs
With CSV files all you need is a single line to load in the data:
In [ ]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv') df
Out[ ]:
Unnamed: 0 | apples | oranges | |
---|---|---|---|
0 | June | 3 | 0 |
1 | Robert | 2 | 3 |
2 | Lily | 0 | 7 |
3 | David | 1 | 2 |
CSVs don’t have indexes like our DataFrames, so all we need to do is just designate the index_col
when reading:
In [ ]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv', index_col=0) df
Out[ ]:
apples | oranges | |
---|---|---|
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
Here we’re setting the index to be column zero.
You’ll find that most CSVs won’t ever have an index column and so usually you don’t have to worry about this step.
Most important DataFrame operations
DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.
Let’s load in the IMDB movies dataset to begin:
In [ ]:
movies_df = pd.read_csv("/content/drive/MyDrive/Dataset/Movie_data.csv", index_col="Title")
In [ ]:
We’re loading this dataset from a CSV and designating the movie titles to be our index.
Viewing your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head()
:
In [ ]:
movies_df.head()
Out[ ]:
Rank | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Guardians of the Galaxy | 1 | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa… | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
Split | 3 | Horror,Thriller | Three girls are kidnapped by a man with a diag… | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar… | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
Sing | 4 | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea… | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma… | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
Suicide Squad | 5 | Action,Adventure,Fantasy | A secret government agency recruits some of th… | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D… | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
.head()
outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10)
would output the top ten rows, for example.
To see the last five rows use .tail()
. tail()
also accepts a number, and in this case we printing the bottom two rows.:
In [ ]:
movies_df.tail(2)
Out[ ]:
Rank | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Search Party | 999 | Adventure,Comedy | A pair of friends embark on a mission to reuni… | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh… | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
Nine Lives | 1000 | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch… | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
Typically when we load in a dataset, we like to view the first five or so rows to see what’s under the hood. Here we can see the names of each column, the index, and examples of values in each row.
You’ll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.
Getting info about your data
.info()
should be one of the very first commands you run after loading your data:
In [ ]:
movies_df.info()
Index: 1000 entries, Guardians of the Galaxy to Nine Lives Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 1000 non-null int64 1 Genre 1000 non-null object 2 Description 1000 non-null object 3 Director 1000 non-null object 4 Actors 1000 non-null object 5 Year 1000 non-null int64 6 Runtime (Minutes) 1000 non-null int64 7 Rating 1000 non-null float64 8 Votes 1000 non-null int64 9 Revenue (Millions) 872 non-null float64 10 Metascore 936 non-null float64 dtypes: float64(3), int64(4), object(4) memory usage: 93.8+ KB
.info()
provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.
Notice in our movies dataset we have some obvious missing values in the Revenue
and Metascore
columns. We’ll look at how to handle those in a bit.
Another fast and useful attribute is .shape
, which outputs just a tuple of (rows, columns):
In [ ]:
movies_df.shape
Out[ ]:
(1000, 11)
Note that .shape
has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.
You’ll be going to .shape
a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.
Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:
In [ ]:
!pip install pandas # This is when you are using Pandas in local system not on google colab.
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9) Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
The !
at the beginning runs cells as if they were in a terminal.
To import pandas we usually import it with a shorter name since it’s used so much:
In [ ]:
import pandas as pd
Now to the basic components of pandas.
Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series
and DataFrame
.
A Series
is essentially a column, and a DataFrame
is a multi-dimensional table made up of a collection of Series.
DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.
You’ll see how these components work when we start working with data below.
Creating DataFrames from scratch
Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.
There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict
.
Let’s say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:
In [ ]:
import pandas as pd
In [ ]:
data = { 'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2] }
And then pass it to the pandas DataFrame constructor:
In [ ]:
purchases = pd.DataFrame(data) purchases
Out[ ]:
apples | oranges | |
---|---|---|
0 | 3 | 0 |
1 | 2 | 3 |
2 | 0 | 7 |
3 | 1 | 2 |
How did that work?
Each (key, value) item in data
corresponds to a column in the resulting DataFrame.
The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.
Let’s have customer names as our index:
In [ ]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David']) purchases
Out[ ]:
apples | oranges | |
---|---|---|
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
So now we could locate a customer’s order by using their name:
In [ ]:
purchases.loc['June']
Out[ ]:
apples 3 oranges 0 Name: June, dtype: int64
There’s more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.
Let’s move on to some quick methods for creating DataFrames from various other sources.
How to read in data
It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.
In [ ]:
from google.colab import drive # Note: (To understand this part better please check the session document) drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
from google.colab import drive drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Reading data from CSVs
With CSV files all you need is a single line to load in the data:
In [ ]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv') df
Out[ ]:
Unnamed: 0 | apples | oranges | |
---|---|---|---|
0 | June | 3 | 0 |
1 | Robert | 2 | 3 |
2 | Lily | 0 | 7 |
3 | David | 1 | 2 |
CSVs don’t have indexes like our DataFrames, so all we need to do is just designate the index_col
when reading:
In [ ]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv', index_col=0) df
Out[ ]:
apples | oranges | |
---|---|---|
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
Here we’re setting the index to be column zero.
You’ll find that most CSVs won’t ever have an index column and so usually you don’t have to worry about this step.
Most important DataFrame operations
DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.
Let’s load in the IMDB movies dataset to begin:
In [ ]:
movies_df = pd.read_csv("/content/drive/MyDrive/Dataset/Movie_data.csv", index_col="Title")
In [ ]:
We’re loading this dataset from a CSV and designating the movie titles to be our index.
Viewing your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head()
:
In [ ]:
movies_df.head()
Out[ ]:
Rank | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Guardians of the Galaxy | 1 | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa… | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
Split | 3 | Horror,Thriller | Three girls are kidnapped by a man with a diag… | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar… | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
Sing | 4 | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea… | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma… | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
Suicide Squad | 5 | Action,Adventure,Fantasy | A secret government agency recruits some of th… | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D… | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
.head()
outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10)
would output the top ten rows, for example.
To see the last five rows use .tail()
. tail()
also accepts a number, and in this case we printing the bottom two rows.:
In [ ]:
movies_df.tail(2)
Out[ ]:
Rank | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Search Party | 999 | Adventure,Comedy | A pair of friends embark on a mission to reuni… | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh… | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
Nine Lives | 1000 | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch… | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
Typically when we load in a dataset, we like to view the first five or so rows to see what’s under the hood. Here we can see the names of each column, the index, and examples of values in each row.
You’ll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.
Getting info about your data
.info()
should be one of the very first commands you run after loading your data:
In [ ]:
movies_df.info()
Index: 1000 entries, Guardians of the Galaxy to Nine Lives Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 1000 non-null int64 1 Genre 1000 non-null object 2 Description 1000 non-null object 3 Director 1000 non-null object 4 Actors 1000 non-null object 5 Year 1000 non-null int64 6 Runtime (Minutes) 1000 non-null int64 7 Rating 1000 non-null float64 8 Votes 1000 non-null int64 9 Revenue (Millions) 872 non-null float64 10 Metascore 936 non-null float64 dtypes: float64(3), int64(4), object(4) memory usage: 93.8+ KB
.info()
provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.
Notice in our movies dataset we have some obvious missing values in the Revenue
and Metascore
columns. We’ll look at how to handle those in a bit.
Another fast and useful attribute is .shape
, which outputs just a tuple of (rows, columns):
In [ ]:
movies_df.shape
Out[ ]:
(1000, 11)
Note that .shape
has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.
You’ll be going to .shape
a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.