Trang chủ Data Visualization Data Visualization: Python Pandas part 1

Data Visualization: Python Pandas part 1

0

Summarize

Concepts Covered
- To understand the basics of pandas and their basic operations.
- Pandas Library Data Frames, CSV file reading, Pandas basic file data operation.

The pandas package is the most important tool at the disposal of Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

[pandas] is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia]

What’s Pandas for?

Pandas has so many uses that it might make sense to list the things it can’t do instead of what it can do.

This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

CSV: A csv’s full form is comma separated values so CSV is just like a spreadsheet.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

  • Calculate statistics and answer questions about the data, like1. What's the average, median, max, or min of each column? 2. Does column A correlate with column B?
  • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
  • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
  • Store the cleaned, transformed data back into a CSV, other file or database

Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best library through which to do that.

How does pandas fit into the data science toolkit?

Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

Pandas First Steps(Optional if using offline ide like Jupyter then only.)

The pandas package is the most important tool at the disposal of Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

[pandas] is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia]

What’s Pandas for?

Pandas has so many uses that it might make sense to list the things it can’t do instead of what it can do.

This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

CSV: A csv’s full form is comma separated values so CSV is just like a spreadsheet.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

  • Calculate statistics and answer questions about the data, like1. What's the average, median, max, or min of each column? 2. Does column A correlate with column B?
  • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
  • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
  • Store the cleaned, transformed data back into a CSV, other file or database

Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best library through which to do that.

How does pandas fit into the data science toolkit?

Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

Pandas First Steps(Optional if using offline ide like Jupyter then only.)

Install and import

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

In [ ]:

!pip install pandas # This is when you are using Pandas in local system not on google colab. 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it’s used so much:

In [ ]:

import pandas as pd 

Now to the basic components of pandas.

Core components of pandas: Series and DataFrames

The primary two components of pandas are the Series and DataFrame.

Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You’ll see how these components work when we start working with data below.

Creating DataFrames from scratch

Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

Let’s say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [ ]:

import pandas as pd

In [ ]:

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

And then pass it to the pandas DataFrame constructor:

In [ ]:

purchases = pd.DataFrame(data)

purchases

Out[ ]:

applesoranges
030
123
207
312

How did that work?

Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let’s have customer names as our index:

In [ ]:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Out[ ]:

applesoranges
June30
Robert23
Lily07
David12

So now we could locate a customer’s order by using their name:

In [ ]:

purchases.loc['June']

Out[ ]:

apples     3
oranges    0
Name: June, dtype: int64

There’s more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.

Let’s move on to some quick methods for creating DataFrames from various other sources.

How to read in data

It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

In [ ]:

from google.colab import drive # Note: (To understand this part better please check the session document)
drive.mount('/content/drive')
Mounted at /content/drive

In [ ]:

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

In [ ]:

df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv')

df

Out[ ]:

Unnamed: 0applesoranges
0June30
1Robert23
2Lily07
3David12

CSVs don’t have indexes like our DataFrames, so all we need to do is just designate the index_col when reading:

In [ ]:

df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv', index_col=0)

df

Out[ ]:

applesoranges
June30
Robert23
Lily07
David12

Here we’re setting the index to be column zero.

You’ll find that most CSVs won’t ever have an index column and so usually you don’t have to worry about this step.

Most important DataFrame operations

DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.

Let’s load in the IMDB movies dataset to begin:

In [ ]:

movies_df = pd.read_csv("/content/drive/MyDrive/Dataset/Movie_data.csv", index_col="Title")

In [ ]:

 

We’re loading this dataset from a CSV and designating the movie titles to be our index.

Viewing your data

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

In [ ]:

movies_df.head()

Out[ ]:

RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
Title
Guardians of the Galaxy1Action,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
Prometheus2Adventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
Split3Horror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
Sing4Animation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
Suicide Squad5Action,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0

.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10) would output the top ten rows, for example.

To see the last five rows use .tail()tail() also accepts a number, and in this case we printing the bottom two rows.:

In [ ]:

movies_df.tail(2)

Out[ ]:

RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
Title
Search Party999Adventure,ComedyA pair of friends embark on a mission to reuni…Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh…2014935.64881NaN22.0
Nine Lives1000Comedy,Family,FantasyA stuffy businessman finds himself trapped ins…Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch…2016875.31243519.6411.0

Typically when we load in a dataset, we like to view the first five or so rows to see what’s under the hood. Here we can see the names of each column, the index, and examples of values in each row.

You’ll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.

Getting info about your data

.info() should be one of the very first commands you run after loading your data:

In [ ]:

movies_df.info()
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Genre               1000 non-null   object 
 2   Description         1000 non-null   object 
 3   Director            1000 non-null   object 
 4   Actors              1000 non-null   object 
 5   Year                1000 non-null   int64  
 6   Runtime (Minutes)   1000 non-null   int64  
 7   Rating              1000 non-null   float64
 8   Votes               1000 non-null   int64  
 9   Revenue (Millions)  872 non-null    float64
 10  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore columns. We’ll look at how to handle those in a bit.

Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

In [ ]:

movies_df.shape

Out[ ]:

(1000, 11)

Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.

You’ll be going to .shape a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

In [ ]:

!pip install pandas # This is when you are using Pandas in local system not on google colab. 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it’s used so much:

In [ ]:

import pandas as pd 

Now to the basic components of pandas.

Core components of pandas: Series and DataFrames

The primary two components of pandas are the Series and DataFrame.

Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You’ll see how these components work when we start working with data below.

Creating DataFrames from scratch

Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

Let’s say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [ ]:

import pandas as pd

In [ ]:

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

And then pass it to the pandas DataFrame constructor:

In [ ]:

purchases = pd.DataFrame(data)

purchases

Out[ ]:

applesoranges
030
123
207
312

How did that work?

Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let’s have customer names as our index:

In [ ]:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Out[ ]:

applesoranges
June30
Robert23
Lily07
David12

So now we could locate a customer’s order by using their name:

In [ ]:

purchases.loc['June']

Out[ ]:

apples     3
oranges    0
Name: June, dtype: int64

There’s more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.

Let’s move on to some quick methods for creating DataFrames from various other sources.

How to read in data

It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

In [ ]:

from google.colab import drive # Note: (To understand this part better please check the session document)
drive.mount('/content/drive')
Mounted at /content/drive

In [ ]:

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

In [ ]:

df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv')

df

Out[ ]:

Unnamed: 0applesoranges
0June30
1Robert23
2Lily07
3David12

CSVs don’t have indexes like our DataFrames, so all we need to do is just designate the index_col when reading:

In [ ]:

df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv', index_col=0)

df

Out[ ]:

applesoranges
June30
Robert23
Lily07
David12

Here we’re setting the index to be column zero.

You’ll find that most CSVs won’t ever have an index column and so usually you don’t have to worry about this step.

Most important DataFrame operations

DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.

Let’s load in the IMDB movies dataset to begin:

In [ ]:

movies_df = pd.read_csv("/content/drive/MyDrive/Dataset/Movie_data.csv", index_col="Title")

In [ ]:

 

We’re loading this dataset from a CSV and designating the movie titles to be our index.

Viewing your data

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

In [ ]:

movies_df.head()

Out[ ]:

RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
Title
Guardians of the Galaxy1Action,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
Prometheus2Adventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
Split3Horror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
Sing4Animation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
Suicide Squad5Action,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0

.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10) would output the top ten rows, for example.

To see the last five rows use .tail()tail() also accepts a number, and in this case we printing the bottom two rows.:

In [ ]:

movies_df.tail(2)

Out[ ]:

RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
Title
Search Party999Adventure,ComedyA pair of friends embark on a mission to reuni…Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh…2014935.64881NaN22.0
Nine Lives1000Comedy,Family,FantasyA stuffy businessman finds himself trapped ins…Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch…2016875.31243519.6411.0

Typically when we load in a dataset, we like to view the first five or so rows to see what’s under the hood. Here we can see the names of each column, the index, and examples of values in each row.

You’ll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.

Getting info about your data

.info() should be one of the very first commands you run after loading your data:

In [ ]:

movies_df.info()
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Genre               1000 non-null   object 
 2   Description         1000 non-null   object 
 3   Director            1000 non-null   object 
 4   Actors              1000 non-null   object 
 5   Year                1000 non-null   int64  
 6   Runtime (Minutes)   1000 non-null   int64  
 7   Rating              1000 non-null   float64
 8   Votes               1000 non-null   int64  
 9   Revenue (Millions)  872 non-null    float64
 10  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore columns. We’ll look at how to handle those in a bit.

Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

In [ ]:

movies_df.shape

Out[ ]:

(1000, 11)

Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.

You’ll be going to .shape a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

0 BÌNH LUẬN

BÌNH LUẬN

Vui lòng nhập bình luận của bạn
Vui lòng nhập tên của bạn ở đây

Exit mobile version