More

    Data Visualization: Python Pandas part 1

    Summarize

    Concepts Covered
    - To understand the basics of pandas and their basic operations.
    - Pandas Library Data Frames, CSV file reading, Pandas basic file data operation.

    The pandas package is the most important tool at the disposal of Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

    [pandas] is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia]

    What’s Pandas for?

    Pandas has so many uses that it might make sense to list the things it can’t do instead of what it can do.

    This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

    CSV: A csv’s full form is comma separated values so CSV is just like a spreadsheet.

    For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

    • Calculate statistics and answer questions about the data, like1. What's the average, median, max, or min of each column? 2. Does column A correlate with column B?
    • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
    • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
    • Store the cleaned, transformed data back into a CSV, other file or database

    Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best library through which to do that.

    How does pandas fit into the data science toolkit?

    Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

    Pandas First Steps(Optional if using offline ide like Jupyter then only.)

    The pandas package is the most important tool at the disposal of Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

    [pandas] is derived from the term "panel data", a term for data sets that include observations over multiple time periods for the same individuals. — [Wikipedia]

    What’s Pandas for?

    Pandas has so many uses that it might make sense to list the things it can’t do instead of what it can do.

    This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

    CSV: A csv’s full form is comma separated values so CSV is just like a spreadsheet.

    For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

    • Calculate statistics and answer questions about the data, like1. What's the average, median, max, or min of each column? 2. Does column A correlate with column B?
    • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
    • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
    • Store the cleaned, transformed data back into a CSV, other file or database

    Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best library through which to do that.

    How does pandas fit into the data science toolkit?

    Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

    Pandas First Steps(Optional if using offline ide like Jupyter then only.)

    Install and import

    Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

    In [ ]:

    !pip install pandas # This is when you are using Pandas in local system not on google colab. 
    
    Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
    Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.2)
    Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
    Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
    Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
    

    The ! at the beginning runs cells as if they were in a terminal.

    To import pandas we usually import it with a shorter name since it’s used so much:

    In [ ]:

    import pandas as pd 
    

    Now to the basic components of pandas.

    Core components of pandas: Series and DataFrames

    The primary two components of pandas are the Series and DataFrame.

    Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

    DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

    You’ll see how these components work when we start working with data below.

    Creating DataFrames from scratch

    Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

    There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

    Let’s say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

    In [ ]:

    import pandas as pd
    

    In [ ]:

    data = {
        'apples': [3, 2, 0, 1], 
        'oranges': [0, 3, 7, 2]
    }
    

    And then pass it to the pandas DataFrame constructor:

    In [ ]:

    purchases = pd.DataFrame(data)
    
    purchases
    

    Out[ ]:

    applesoranges
    030
    123
    207
    312

    How did that work?

    Each (key, value) item in data corresponds to a column in the resulting DataFrame.

    The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

    Let’s have customer names as our index:

    In [ ]:

    purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
    
    purchases
    

    Out[ ]:

    applesoranges
    June30
    Robert23
    Lily07
    David12

    So now we could locate a customer’s order by using their name:

    In [ ]:

    purchases.loc['June']
    

    Out[ ]:

    apples     3
    oranges    0
    Name: June, dtype: int64

    There’s more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.

    Let’s move on to some quick methods for creating DataFrames from various other sources.

    How to read in data

    It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

    In [ ]:

    from google.colab import drive # Note: (To understand this part better please check the session document)
    drive.mount('/content/drive')
    
    Mounted at /content/drive
    

    In [ ]:

    from google.colab import drive
    drive.mount('/content/drive')
    
    Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
    

    Reading data from CSVs

    With CSV files all you need is a single line to load in the data:

    In [ ]:

    df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv')
    
    df
    

    Out[ ]:

    Unnamed: 0applesoranges
    0June30
    1Robert23
    2Lily07
    3David12

    CSVs don’t have indexes like our DataFrames, so all we need to do is just designate the index_col when reading:

    In [ ]:

    df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv', index_col=0)
    
    df
    

    Out[ ]:

    applesoranges
    June30
    Robert23
    Lily07
    David12

    Here we’re setting the index to be column zero.

    You’ll find that most CSVs won’t ever have an index column and so usually you don’t have to worry about this step.

    Most important DataFrame operations

    DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.

    Let’s load in the IMDB movies dataset to begin:

    In [ ]:

    movies_df = pd.read_csv("/content/drive/MyDrive/Dataset/Movie_data.csv", index_col="Title")
    

    In [ ]:

     
    

    We’re loading this dataset from a CSV and designating the movie titles to be our index.

    Viewing your data

    The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

    In [ ]:

    movies_df.head()
    

    Out[ ]:

    RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
    Title
    Guardians of the Galaxy1Action,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
    Prometheus2Adventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
    Split3Horror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
    Sing4Animation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
    Suicide Squad5Action,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0

    .head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10) would output the top ten rows, for example.

    To see the last five rows use .tail()tail() also accepts a number, and in this case we printing the bottom two rows.:

    In [ ]:

    movies_df.tail(2)
    

    Out[ ]:

    RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
    Title
    Search Party999Adventure,ComedyA pair of friends embark on a mission to reuni…Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh…2014935.64881NaN22.0
    Nine Lives1000Comedy,Family,FantasyA stuffy businessman finds himself trapped ins…Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch…2016875.31243519.6411.0

    Typically when we load in a dataset, we like to view the first five or so rows to see what’s under the hood. Here we can see the names of each column, the index, and examples of values in each row.

    You’ll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.

    Getting info about your data

    .info() should be one of the very first commands you run after loading your data:

    In [ ]:

    movies_df.info()
    
    Index: 1000 entries, Guardians of the Galaxy to Nine Lives
    Data columns (total 11 columns):
     #   Column              Non-Null Count  Dtype  
    ---  ------              --------------  -----  
     0   Rank                1000 non-null   int64  
     1   Genre               1000 non-null   object 
     2   Description         1000 non-null   object 
     3   Director            1000 non-null   object 
     4   Actors              1000 non-null   object 
     5   Year                1000 non-null   int64  
     6   Runtime (Minutes)   1000 non-null   int64  
     7   Rating              1000 non-null   float64
     8   Votes               1000 non-null   int64  
     9   Revenue (Millions)  872 non-null    float64
     10  Metascore           936 non-null    float64
    dtypes: float64(3), int64(4), object(4)
    memory usage: 93.8+ KB
    

    .info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

    Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore columns. We’ll look at how to handle those in a bit.

    Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

    In [ ]:

    movies_df.shape
    

    Out[ ]:

    (1000, 11)

    Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.

    You’ll be going to .shape a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

    Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

    In [ ]:

    !pip install pandas # This is when you are using Pandas in local system not on google colab. 
    
    Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
    Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.2)
    Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
    Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
    Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
    

    The ! at the beginning runs cells as if they were in a terminal.

    To import pandas we usually import it with a shorter name since it’s used so much:

    In [ ]:

    import pandas as pd 
    

    Now to the basic components of pandas.

    Core components of pandas: Series and DataFrames

    The primary two components of pandas are the Series and DataFrame.

    Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

    DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

    You’ll see how these components work when we start working with data below.

    Creating DataFrames from scratch

    Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

    There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

    Let’s say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

    In [ ]:

    import pandas as pd
    

    In [ ]:

    data = {
        'apples': [3, 2, 0, 1], 
        'oranges': [0, 3, 7, 2]
    }
    

    And then pass it to the pandas DataFrame constructor:

    In [ ]:

    purchases = pd.DataFrame(data)
    
    purchases
    

    Out[ ]:

    applesoranges
    030
    123
    207
    312

    How did that work?

    Each (key, value) item in data corresponds to a column in the resulting DataFrame.

    The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

    Let’s have customer names as our index:

    In [ ]:

    purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
    
    purchases
    

    Out[ ]:

    applesoranges
    June30
    Robert23
    Lily07
    David12

    So now we could locate a customer’s order by using their name:

    In [ ]:

    purchases.loc['June']
    

    Out[ ]:

    apples     3
    oranges    0
    Name: June, dtype: int64

    There’s more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.

    Let’s move on to some quick methods for creating DataFrames from various other sources.

    How to read in data

    It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

    In [ ]:

    from google.colab import drive # Note: (To understand this part better please check the session document)
    drive.mount('/content/drive')
    
    Mounted at /content/drive
    

    In [ ]:

    from google.colab import drive
    drive.mount('/content/drive')
    
    Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
    

    Reading data from CSVs

    With CSV files all you need is a single line to load in the data:

    In [ ]:

    df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv')
    
    df
    

    Out[ ]:

    Unnamed: 0applesoranges
    0June30
    1Robert23
    2Lily07
    3David12

    CSVs don’t have indexes like our DataFrames, so all we need to do is just designate the index_col when reading:

    In [ ]:

    df = pd.read_csv('/content/drive/MyDrive/Dataset/purchases.csv', index_col=0)
    
    df
    

    Out[ ]:

    applesoranges
    June30
    Robert23
    Lily07
    David12

    Here we’re setting the index to be column zero.

    You’ll find that most CSVs won’t ever have an index column and so usually you don’t have to worry about this step.

    Most important DataFrame operations

    DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.

    Let’s load in the IMDB movies dataset to begin:

    In [ ]:

    movies_df = pd.read_csv("/content/drive/MyDrive/Dataset/Movie_data.csv", index_col="Title")
    

    In [ ]:

     
    

    We’re loading this dataset from a CSV and designating the movie titles to be our index.

    Viewing your data

    The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

    In [ ]:

    movies_df.head()
    

    Out[ ]:

    RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
    Title
    Guardians of the Galaxy1Action,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
    Prometheus2Adventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
    Split3Horror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
    Sing4Animation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
    Suicide Squad5Action,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0

    .head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10) would output the top ten rows, for example.

    To see the last five rows use .tail()tail() also accepts a number, and in this case we printing the bottom two rows.:

    In [ ]:

    movies_df.tail(2)
    

    Out[ ]:

    RankGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
    Title
    Search Party999Adventure,ComedyA pair of friends embark on a mission to reuni…Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh…2014935.64881NaN22.0
    Nine Lives1000Comedy,Family,FantasyA stuffy businessman finds himself trapped ins…Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch…2016875.31243519.6411.0

    Typically when we load in a dataset, we like to view the first five or so rows to see what’s under the hood. Here we can see the names of each column, the index, and examples of values in each row.

    You’ll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.

    Getting info about your data

    .info() should be one of the very first commands you run after loading your data:

    In [ ]:

    movies_df.info()
    
    Index: 1000 entries, Guardians of the Galaxy to Nine Lives
    Data columns (total 11 columns):
     #   Column              Non-Null Count  Dtype  
    ---  ------              --------------  -----  
     0   Rank                1000 non-null   int64  
     1   Genre               1000 non-null   object 
     2   Description         1000 non-null   object 
     3   Director            1000 non-null   object 
     4   Actors              1000 non-null   object 
     5   Year                1000 non-null   int64  
     6   Runtime (Minutes)   1000 non-null   int64  
     7   Rating              1000 non-null   float64
     8   Votes               1000 non-null   int64  
     9   Revenue (Millions)  872 non-null    float64
     10  Metascore           936 non-null    float64
    dtypes: float64(3), int64(4), object(4)
    memory usage: 93.8+ KB
    

    .info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

    Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore columns. We’ll look at how to handle those in a bit.

    Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

    In [ ]:

    movies_df.shape
    

    Out[ ]:

    (1000, 11)

    Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.

    You’ll be going to .shape a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

    Recent Articles

    Related Stories

    BÌNH LUẬN

    Vui lòng nhập bình luận của bạn
    Vui lòng nhập tên của bạn ở đây