Introduction to Data Analysis in
¶

Python

using
¶

Pandas

What is Pandas?

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

What is Pandas for?

  • pandas is the right tool when working with tabular data
    • such as data stored in spreadsheets or databases
  • pandas will help you to
    • explore
    • clean
    • transform
    • merge
    • process
  • In pandas, a data table is called a DataFrame
    Pandas DataFrame
  • In pandas, a column of a data table is called a Series
    • A DataFrame is a collection of Series

With pandas you can

  1. Import many types of data, including
    • CSV files
    • Tab or other types of delimited files
    • Excel (xls, xlsx) files
    • Stata files
  1. Open files directly from a website
  2. Merge, select, join data
  3. Perform statistical analyses
  4. Create plots of your data

and much more.

  • pandas is imported as
    import pandas as pd
    
    I suggest you follow this convention, which will make using other people's code and snippets easier.
In [1]:
# Let's import pandas and some other basic packages we will use 
from __future__ import division
%matplotlib inline
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

How to create a Pandas `DataFrame`?

Import and Export Data¶

Pandas

Import Existing Data¶

pandas supports the integration with many file formats or data sources

  • csv
  • excel
  • sql
  • json
  • parquet
  • Stata
  • HTML

To import data in a specific format you use the function with the prefix read_*, e.g.,

  • csv: pd.read_csv
  • excel: pd.read_excel
  • stata: pd.read_stata
  • html: pd.read_html

Export Data¶

To export data in a specific format you use the function with the prefix to_*, e.g.,

  • csv: pd.to_csv
  • excel: pd.to_excel
  • stata: pd.to_stata

Create a New DataFrame¶

Pandas DataFrame
  • To create a new Series
    my_new_series = pd.Series(a_list_with_data, 
                            name='Name of Column in data table')
    
  • To create a new DataFrame
    df = pd.DataFrame(a_list_of_series_or_raw_data,
                    columns=list_with_column_names,
                    index=list_of_indices_if_you_need)
    

Select subset of data¶

Subset data

Select subset of columns¶

Subset data
  • To select 1 column and get a Series
    df['name of column']
    
    or
    df.name_of_column
    
Careful: The second method only works if there are no spaces in the name of the variable/row.
  • To select a subset of columns and get a DataFrame
    df[list_of_columns]
    
  • In particular, to get a DataFrame from a single column
    df[['name of column']]
    
Note: Notice the small difference when selecting a unique column

Select subset of rows¶

Subset data
  • To select specific rows, you need to pass the index of the rows you want and use df.iloc.

    E.g., if you want the observations in index [0,2,5,9], then

    df.iloc[[0,2,5,9]]
    
  • To select rows that satisfy a condition, e.g., variable_1>0, use df.loc

    df.loc[df['variable_1']>0]
    
  • or more generally

    df.loc[condition to be satisfied by data in df]
    

Select subset of rows and columns¶

Subset data
  • To select jointly a subset of rows that satisfy a condition and some of the variables
    df.loc[condition to be satisfied by data in df, list_of_columns]
    
    e.g., select rows that satisfy variable_1>0 and only the subset of columns ['variable_1', 'variable_2']
    df.loc[df['variable_1']>0, ['variable_1', 'variable_2']]
    

Plot Data¶

Plotting

pandas can easily create various types of plots based on the data

  • Plot all variables, where $x$ is the index
    df.plot()
    
  • Plot only one specific variable (against the index)
    df['variable'].plot()
    
    or
    df.variable_name.plot()
    
  • Plot variable X against variable Y
    df.plot.scatter(x='Variable X', y='Variable Y')
    
  • Other plot types:
    • Box plot: df.plot.box()
    • Area plot: df.plot.area()
    • (Vertical) Bar plot: df.plot.bar()
    • (Horizontal) Bar plot: df.plot.barh()
    • Density plot: df.plot.density()
    • Historam: df.plot.hist()
    • KDE plot: df.plot.kde()
    • Pie Chart: df.plot.pie()

pandas uses matplotlib, so we can pass options or use axes and figures as we learned before¶

We can also pass pandas dataframes to seaborn, statsmodels, and many other packages¶

Create New Columns/Variables¶

New Columns

From Other Columns/Variables¶

  • To create a new column in a dataframe from other columns in the dataframe just perform any operations on the existing variables
  • All operations are performed across all rows
  • E.g., to create the new variable $Y$ defined as $Y=2\cdot X^2-\ln(Z)$, where $X$ and $Z$ are existing variables
    df['Y'] = 2 * df['X']**2 - df['Z'].apply(np.log)
    

From a list or numpy.array¶

  • To create a new column in the dataframe from a list or array, just define the new column to be equal to the list or array
    df['new variable'] = list_or_array
    
Careful: The list or array must have the same dimension as the dataframe. Check it using df.shape and array.shape or len(list)

Applying Functions to Dataframe¶

New Columns
  • pandas also let's you apply a function to a column or to the whole dataset using the apply function.
  • To apply a function to the whole dataset:
    df.apply(my_function, axis=1)
    
    This is useful when the function uses data from various columns
  • To apply a function to only a column:
    df['My Column'].apply(my_function, axis=1)
    
    or
    df.my_column.apply(my_function, axis=1)
    

Create Statistics for Columns in Dataframe¶

groupby

The pandas DataFrame has various properties that compute statistics based on columns in the dataframe

  • mean
  • min
  • max
  • std
  • median
  • quantile

These statistics can be computed across rows or columns

  • Statistic across rows (for all columns): df.statistic() or df.statistic(axis=0)
  • Statistics across columns: df.statistic(axis=1)

The special method describe generates various summary statistics

df.describe()
Note: You can apply these statistics to a single column or a subset of columns.

Apply Functions or Create Aggregate Statistics by Groups of Rows¶

groupby
  • To generate statistics or apply functions my groups in the data we use the groupby method
  • To generate a dataframe that contains the statistics for each group defined by columns named Variable1 and Variable2
df.groupby(['Variable1', 'Variable2']).statistic()
  • To apply a function to all rows in each group and generate a dataframe with the group level result
    df.groupby(['Variable1', 'Variable2']).apply(my_function)
    

Reshape DataFrame¶

meltpivot

Many times you may need to reshape you data¶

  • To be used in other python commands
  • For further processing

Reshape DataFrame From Wide To Long¶

melt

Original dataframe has units (countries, households, individuals) in rows and variables or yearly observations in columns¶

New dataframe has unit $\times$ variable/year in each row and values in columns¶

Useful commands¶

  • pd.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+')
  • pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
  • df.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
  • df.stack(level=- 1, dropna=True)
melt
melt

Reshape DataFrame From Long To Wide¶

pivot

Original dataframe has unit $\times$ variable/year in each row and values in columns¶

New dataframe has units (countries, households, individuals) in rows and variables or yearly observations in columns¶

Useful commands¶

  • pd.pivot(data, index=None, columns=None, values=None)
  • df.pivot(index=None, columns=None, values=None)
  • df.unstack(level=- 1, fill_value=None)
melt
melt

Combine Data from Multiple DataFrames¶

concatenate

Concatenating Dataframes¶

concatenate

pandas let's you easily concatenate various Series or DataFrames to create a new DataFrame¶

pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

pd.concat([df1, df2, df3])
concatenate
pd.concat([df1, df4], axis=1)
concatenate
pd.concat([df1, df4], axis=1, join='inner')
concatenate

Merging/Joining Dataframes¶

concatenate

pandas let's you easily merge/join Series or DataFrames to create a new DataFrame¶

  • pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
  • df.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
pd.merge([left, right])

or

left.merge(right)

or

right.merge(left)
concatenate
pd.merge([left, right])

or

pd.merge([left, right], on=["key1", "key2"])
concatenate
pd.merge([left, right], how="left", on=["key1", "key2"])

or

left.merge(right, how="left", on=["key1", "key2"])

or

right.merge(left, how="left", on=["key1", "key2"])
concatenate
pd.merge([left, right], how="right", on=["key1", "key2"])

or

left.merge(right, how="right", on=["key1", "key2"])

or

right.merge(left, how="right", on=["key1", "key2"])
concatenate
pd.merge([left, right], how="outer", on=["key1", "key2"])

or

left.merge(right, how="outer", on=["key1", "key2"])

or

right.merge(left, how="outer", on=["key1", "key2"])
concatenate

Many Other Options and Possibilities¶

  • Working with Time Series
  • Working with Text

Examples
¶

Example - Import data¶

Let's import the table of countries' ISO codes from Wikipedia¶

In [2]:
# Import display options for showing websites
from IPython.display import IFrame
url = 'https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'
IFrame(url, width=800, height=400)
Out[2]:
In [3]:
isocodes = pd.read_html(url, encoding='utf-8')[0]
isocodes
Out[3]:
ISO 3166[1] Unnamed: 1_level_0 Unnamed: 2_level_0 ISO 3166-1[2] ISO 3166-2[3] Unnamed: 7_level_0
Country name[5] Official state name[6] Sovereignty[6][7][8] Alpha-2 code[5] Alpha-3 code[5] Numeric code[5] Subdivision code links[3] Internet ccTLD[9]
0 Afghanistan The Islamic Republic of Afghanistan UN member state .mw-parser-output .monospaced{font-family:mono... AFG 004 ISO 3166-2:AF .af
1 Åland Islands Åland Finland AX ALA 248 ISO 3166-2:AX .ax
2 Albania The Republic of Albania UN member state AL ALB 008 ISO 3166-2:AL .al
3 Algeria The People's Democratic Republic of Algeria UN member state DZ DZA 012 ISO 3166-2:DZ .dz
4 American Samoa The Territory of American Samoa United States AS ASM 016 ISO 3166-2:AS .as
... ... ... ... ... ... ... ... ...
266 Wallis and Futuna The Territory of the Wallis and Futuna Islands France WF WLF 876 ISO 3166-2:WF .wf
267 Western Sahara [ah] The Sahrawi Arab Democratic Republic Disputed [ai] EH ESH 732 ISO 3166-2:EH [aj]
268 Yemen The Republic of Yemen UN member state YE YEM 887 ISO 3166-2:YE .ye
269 Zambia The Republic of Zambia UN member state ZM ZMB 894 ISO 3166-2:ZM .zm
270 Zimbabwe The Republic of Zimbabwe UN member state ZW ZWE 716 ISO 3166-2:ZW .zw

271 rows × 8 columns

Not perfect, but we can correct it and make it look nice¶

In [4]:
isocodes.columns
Out[4]:
MultiIndex([(       'ISO 3166[1]',           'Country name[5]'),
            ('Unnamed: 1_level_0',    'Official state name[6]'),
            ('Unnamed: 2_level_0',      'Sovereignty[6][7][8]'),
            (     'ISO 3166-1[2]',           'Alpha-2 code[5]'),
            (     'ISO 3166-1[2]',           'Alpha-3 code[5]'),
            (     'ISO 3166-1[2]',           'Numeric code[5]'),
            (     'ISO 3166-2[3]', 'Subdivision code links[3]'),
            ('Unnamed: 7_level_0',         'Internet ccTLD[9]')],
           )

First, let's drop the first column index¶

In [5]:
isocodes = isocodes.droplevel(0, axis=1)
isocodes.head()
Out[5]:
Country name[5] Official state name[6] Sovereignty[6][7][8] Alpha-2 code[5] Alpha-3 code[5] Numeric code[5] Subdivision code links[3] Internet ccTLD[9]
0 Afghanistan The Islamic Republic of Afghanistan UN member state .mw-parser-output .monospaced{font-family:mono... AFG 004 ISO 3166-2:AF .af
1 Åland Islands Åland Finland AX ALA 248 ISO 3166-2:AX .ax
2 Albania The Republic of Albania UN member state AL ALB 008 ISO 3166-2:AL .al
3 Algeria The People's Democratic Republic of Algeria UN member state DZ DZA 012 ISO 3166-2:DZ .dz
4 American Samoa The Territory of American Samoa United States AS ASM 016 ISO 3166-2:AS .as

Second, let's correct column names¶

In [6]:
mycols = isocodes.columns
mycols = [c[:c.find('[')] for c in mycols]
mycols
Out[6]:
['Country name',
 'Official state name',
 'Sovereignty',
 'Alpha-2 code',
 'Alpha-3 code',
 'Numeric code',
 'Subdivision code links',
 'Internet ccTLD']
In [7]:
isocodes.columns = mycols
isocodes.head()
Out[7]:
Country name Official state name Sovereignty Alpha-2 code Alpha-3 code Numeric code Subdivision code links Internet ccTLD
0 Afghanistan The Islamic Republic of Afghanistan UN member state .mw-parser-output .monospaced{font-family:mono... AFG 004 ISO 3166-2:AF .af
1 Åland Islands Åland Finland AX ALA 248 ISO 3166-2:AX .ax
2 Albania The Republic of Albania UN member state AL ALB 008 ISO 3166-2:AL .al
3 Algeria The People's Democratic Republic of Algeria UN member state DZ DZA 012 ISO 3166-2:DZ .dz
4 American Samoa The Territory of American Samoa United States AS ASM 016 ISO 3166-2:AS .as

Third, let's correct Alpha-2 code using Subdivision code links¶

In [8]:
isocodes['Alpha-2 code original'] = isocodes['Alpha-2 code']
isocodes['Alpha-2 code'] = isocodes['Subdivision code links'].apply(lambda x: x[x.find(':')+1:])
isocodes.head()
Out[8]:
Country name Official state name Sovereignty Alpha-2 code Alpha-3 code Numeric code Subdivision code links Internet ccTLD Alpha-2 code original
0 Afghanistan The Islamic Republic of Afghanistan UN member state AF AFG 004 ISO 3166-2:AF .af .mw-parser-output .monospaced{font-family:mono...
1 Åland Islands Åland Finland AX ALA 248 ISO 3166-2:AX .ax AX
2 Albania The Republic of Albania UN member state AL ALB 008 ISO 3166-2:AL .al AL
3 Algeria The People's Democratic Republic of Algeria UN member state DZ DZA 012 ISO 3166-2:DZ .dz DZ
4 American Samoa The Territory of American Samoa United States AS ASM 016 ISO 3166-2:AS .as AS

Now, let's import the table of countries' GDP per capita (PPP) from Wikipedia_per_capita)¶

In [9]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita'
IFrame(url, width=800, height=400)
Out[9]:
In [10]:
gdppc_wiki = pd.read_html(url, encoding='utf-8')[1]
gdppc_wiki
Out[10]:
Country/Territory UN Region IMF[5][6] World Bank[7] CIA[8]
Country/Territory UN Region Estimate Year Estimate Year Estimate Year
0 Monaco * Europe — — 190513 2019 115700 2015
1 Liechtenstein * Europe — — 180367 2018 139100 2009
2 Luxembourg * Europe 140694 2022 118360 2020 110300 2020
3 Singapore * Asia 131580 2022 98526 2020 93400 2020
4 Ireland * Europe 124596 2022 93612 2020 89700 2020
... ... ... ... ... ... ... ... ...
225 Somalia * Africa 1322 2022 875.2 2020 800 2020
226 DR Congo * Africa 1316 2022 1131 2020 1098 2019
227 Central African Republic * Africa 1102 2022 979.6 2020 945 2019
228 South Sudan * Africa 928 2022 1235 2015 1600 2017
229 Burundi * Africa 856 2022 771.2 2020 700 2020

230 rows × 8 columns

Again we need to clean the data a little bit¶

In [11]:
gdppc_wiki.columns = ['Country/Territory', 'UN Region', 'gdppc_IMF', 'year_IMF',
                      'gdppc_WB', 'year_WB', 'gdppc_CIA', 'year_CIA']
gdppc_wiki.head()
Out[11]:
Country/Territory UN Region gdppc_IMF year_IMF gdppc_WB year_WB gdppc_CIA year_CIA
0 Monaco * Europe — — 190513 2019 115700 2015
1 Liechtenstein * Europe — — 180367 2018 139100 2009
2 Luxembourg * Europe 140694 2022 118360 2020 110300 2020
3 Singapore * Asia 131580 2022 98526 2020 93400 2020
4 Ireland * Europe 124596 2022 93612 2020 89700 2020

Let's eliminate the * in the country names¶

In [12]:
gdppc_wiki['country_name'] = gdppc_wiki['Country/Territory'].str.replace('*', '', regex=True).str.strip()
gdppc_wiki.head()
Out[12]:
Country/Territory UN Region gdppc_IMF year_IMF gdppc_WB year_WB gdppc_CIA year_CIA country_name
0 Monaco * Europe — — 190513 2019 115700 2015 Monaco
1 Liechtenstein * Europe — — 180367 2018 139100 2009 Liechtenstein
2 Luxembourg * Europe 140694 2022 118360 2020 110300 2020 Luxembourg
3 Singapore * Asia 131580 2022 98526 2020 93400 2020 Singapore
4 Ireland * Europe 124596 2022 93612 2020 89700 2020 Ireland

Let's make sure years and GDPpc columns are treated as numbers¶

In [13]:
gdppc_wiki.dtypes
Out[13]:
Country/Territory    object
UN Region            object
gdppc_IMF            object
year_IMF             object
gdppc_WB             object
year_WB              object
gdppc_CIA            object
year_CIA             object
country_name         object
dtype: object
In [14]:
for c in gdppc_wiki.columns[2:-1]:
    gdppc_wiki[c] = pd.to_numeric(gdppc_wiki[c].str.replace('—', 'nan'), errors='coerce')
    if c.startswith('year'):
        gdppc_wiki[c] = gdppc_wiki[c].astype('Int64')
In [15]:
gdppc_wiki.dtypes
Out[15]:
Country/Territory     object
UN Region             object
gdppc_IMF            float64
year_IMF               Int64
gdppc_WB             float64
year_WB                Int64
gdppc_CIA            float64
year_CIA               Int64
country_name          object
dtype: object

Let's try to merge the data from both dataframes¶

In [16]:
isocodes.head(2)
Out[16]:
Country name Official state name Sovereignty Alpha-2 code Alpha-3 code Numeric code Subdivision code links Internet ccTLD Alpha-2 code original
0 Afghanistan The Islamic Republic of Afghanistan UN member state AF AFG 004 ISO 3166-2:AF .af .mw-parser-output .monospaced{font-family:mono...
1 Åland Islands Åland Finland AX ALA 248 ISO 3166-2:AX .ax AX
In [17]:
gdppc_wiki.head(1)
Out[17]:
Country/Territory UN Region gdppc_IMF year_IMF gdppc_WB year_WB gdppc_CIA year_CIA country_name
0 Monaco * Europe NaN <NA> 190513.0 2019 115700.0 2015 Monaco

The only common information in both dataframes is the country's name, so let's merge using the corrected country_name and Country name¶

In [18]:
merged = isocodes.merge(gdppc_wiki, left_on='Country name', right_on='country_name')
merged
Out[18]:
Country name Official state name Sovereignty Alpha-2 code Alpha-3 code Numeric code Subdivision code links Internet ccTLD Alpha-2 code original Country/Territory UN Region gdppc_IMF year_IMF gdppc_WB year_WB gdppc_CIA year_CIA country_name
0 Afghanistan The Islamic Republic of Afghanistan UN member state AF AFG 004 ISO 3166-2:AF .af .mw-parser-output .monospaced{font-family:mono... Afghanistan * Asia 2456.0 2020 2088.0 2020 2065.0 2019 Afghanistan
1 Albania The Republic of Albania UN member state AL ALB 008 ISO 3166-2:AL .al AL Albania * Europe 17383.0 2022 13818.0 2020 13965.0 2019 Albania
2 Algeria The People's Democratic Republic of Algeria UN member state DZ DZA 012 ISO 3166-2:DZ .dz DZ Algeria * Africa 16509.0 2022 11268.0 2020 11511.0 2019 Algeria
3 American Samoa The Territory of American Samoa United States AS ASM 016 ISO 3166-2:AS .as AS American Samoa * Oceania NaN <NA> NaN <NA> 11200.0 2016 American Samoa
4 Andorra The Principality of Andorra UN member state AD AND 020 ISO 3166-2:AD .ad AD Andorra * Europe 63600.0 2022 NaN <NA> 49900.0 2015 Andorra
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
169 Vanuatu The Republic of Vanuatu UN member state VU VUT 548 ISO 3166-2:VU .vu VU Vanuatu * Oceania 2851.0 2022 2915.0 2020 3153.0 2019 Vanuatu
170 Wallis and Futuna The Territory of the Wallis and Futuna Islands France WF WLF 876 ISO 3166-2:WF .wf WF Wallis and Futuna * Oceania NaN <NA> NaN <NA> 3800.0 2004 Wallis and Futuna
171 Yemen The Republic of Yemen UN member state YE YEM 887 ISO 3166-2:YE .ye YE Yemen * Asia 2078.0 2022 3689.0 2013 2500.0 2017 Yemen
172 Zambia The Republic of Zambia UN member state ZM ZMB 894 ISO 3166-2:ZM .zm ZM Zambia * Africa 3776.0 2022 3450.0 2020 3470.0 2019 Zambia
173 Zimbabwe The Republic of Zimbabwe UN member state ZW ZWE 716 ISO 3166-2:ZW .zw ZW Zimbabwe * Africa 2523.0 2022 2895.0 2020 2836.0 2019 Zimbabwe

174 rows × 18 columns

Careful: Remember that merge will create a complete dataset only for the countries that have common names in both.
In [19]:
merged.shape
Out[19]:
(174, 18)

What are the not common country names in both dataframes?¶

In [20]:
isocodes_names = set(isocodes['Country name'])
gdppc_wiki_names = set(gdppc_wiki['country_name'])
In [21]:
isocodes_names.difference(gdppc_wiki_names)
Out[21]:
{'Antarctica\u200a[a]',
 'Australia\u200a[b]',
 'Bahamas (the)',
 'Bolivia (Plurinational State of)',
 'Bonaire\xa0Sint Eustatius\xa0Saba',
 'Bouvet Island',
 'British Indian Ocean Territory (the)',
 'British Virgin Islands – See Virgin Islands (British).',
 'Brunei Darussalam\u200a[e]',
 'Burma – See Myanmar.',
 'Cabo Verde\u200a[f]',
 'Cape Verde – See Cabo Verde.',
 'Caribbean Netherlands – See Bonaire, Sint Eustatius and Saba.',
 'Cayman Islands (the)',
 'Central African Republic (the)',
 'China, The Republic of – See Taiwan (Province of China).',
 'Christmas Island',
 'Cocos (Keeling) Islands (the)',
 'Comoros (the)',
 'Congo (the Democratic Republic of the)',
 'Congo (the)\u200a[g]',
 'Cook Islands (the)',
 'Czechia\u200a[i]',
 "Côte d'Ivoire\u200a[h]",
 "Democratic People's Republic of Korea – See Korea, The Democratic People's Republic of.",
 'Democratic Republic of the Congo – See Congo, The Democratic Republic of the.',
 'Dominican Republic (the)',
 'East Timor – See Timor-Leste.',
 'Eswatini\u200a[j]',
 'Falkland Islands (the) [Malvinas]\u200a[k]',
 'Faroe Islands (the)',
 'France\u200a[l]',
 'French Guiana',
 'French Southern Territories (the)\u200a[m]',
 'Gambia (the)',
 'Great Britain – See United Kingdom, The.',
 'Guadeloupe',
 'Heard Island and McDonald Islands',
 'Holy See (the)\u200a[n]',
 'Iran (Islamic Republic of)',
 "Ivory Coast – See Côte d'Ivoire.",
 'Jan Mayen – See Svalbard and Jan Mayen.',
 "Korea (the Democratic People's Republic of)\u200a[o]",
 'Korea (the Republic of)\u200a[p]',
 "Lao People's Democratic Republic (the)\u200a[q]",
 'Macao\u200a[r]',
 'Marshall Islands (the)',
 'Martinique',
 'Mayotte',
 'Micronesia (Federated States of)',
 'Moldova (the Republic of)',
 'Myanmar\u200a[t]',
 'Netherlands (the)',
 'Niger (the)',
 'Norfolk Island',
 "North Korea – See Korea, The Democratic People's Republic of.",
 'North Macedonia\u200a[s]',
 'Northern Mariana Islands (the)',
 'Palestine, State of',
 "People's Republic of China – See China.",
 'Philippines (the)',
 'Pitcairn\u200a[u]',
 'Republic of China – See Taiwan (Province of China).',
 'Republic of Korea – See Korea, The Republic of.',
 'Republic of the Congo – See Congo, The.',
 'Russian Federation (the)\u200a[v]',
 'Réunion',
 'Saba – See Bonaire, Sint Eustatius and Saba.',
 'Sahrawi Arab Democratic Republic – See Western Sahara.',
 'Saint Barthélemy',
 'Saint Helena\xa0Ascension Island\xa0Tristan da Cunha',
 'Saint Martin (French part)',
 'Sao Tome and Principe',
 'Sint Eustatius – See Bonaire, Sint Eustatius and Saba.',
 'Sint Maarten (Dutch part)',
 'South Georgia and the South Sandwich Islands',
 'South Korea – See Korea, The Republic of.',
 'Sudan (the)',
 'Svalbard\xa0Jan Mayen',
 'Syrian Arab Republic (the)\u200a[x]',
 'Taiwan (Province of China)\u200a[y]',
 'Tanzania, the United Republic of',
 'Timor-Leste\u200a[aa]',
 'Turks and Caicos Islands (the)',
 'Türkiye',
 'United Arab Emirates (the)',
 'United Kingdom of Great Britain and Northern Ireland (the)',
 'United States Minor Outlying Islands (the)\u200a[ac]',
 'United States Virgin Islands – See Virgin Islands (U.S.).',
 'United States of America (the)',
 'Vatican City – See Holy See, The.',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam\u200a[ae]',
 'Virgin Islands (British)\u200a[af]',
 'Virgin Islands (U.S.)\u200a[ag]',
 'Western Sahara\u200a[ah]',
 'Åland Islands'}
In [22]:
gdppc_wiki_names.difference(isocodes_names)
Out[22]:
{'Australia',
 'Bahamas',
 'Bolivia',
 'British Virgin Islands',
 'Brunei',
 'Cape Verde',
 'Cayman Islands',
 'Central African Republic',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Czech Republic',
 'DR Congo',
 'Dominican Republic',
 'East Timor',
 'Eswatini',
 'European Union',
 'Falkland Islands',
 'Faroe Islands',
 'France',
 'Gambia',
 'Iran',
 'Ivory Coast',
 'Kosovo',
 'Laos',
 'Macau',
 'Marshall Islands',
 'Micronesia',
 'Moldova',
 'Myanmar',
 'Netherlands',
 'Niger',
 'North Korea',
 'North Macedonia',
 'Northern Mariana Islands',
 'Palestine',
 'Philippines',
 'Russia',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Martin',
 'Sint Maarten',
 'South Korea',
 'Sudan',
 'Syria',
 'São Tomé and Príncipe',
 'Taiwan',
 'Tanzania',
 'Turkey',
 'Turks and Caicos Islands',
 'U.S. Virgin Islands',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Venezuela',
 'Vietnam',
 'World'}

Clearly to create the full dataset, we'd need to standardize the country names.¶

This is a major reason to use ISO CODES!¶

We'll learn methods to do this in another lecture, for now let's use the merged subset¶

Simple Plots¶

In [23]:
# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(10,6))
merged.gdppc_CIA.plot.kde(ax=ax, label='CIA')
merged.gdppc_IMF.plot.kde(ax=ax, label='IMF')
merged.gdppc_WB.plot.kde(ax=ax, label='WB')
ax.legend()
Out[23]:
<matplotlib.legend.Legend at 0x18c3ff790>
In [24]:
# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(10,6))
merged.gdppc_CIA.plot.hist(ax=ax, label='CIA')
merged.gdppc_IMF.plot.hist(ax=ax, label='IMF', alpha=0.6)
merged.gdppc_WB.plot.hist(ax=ax, label='WB', alpha=0.3)
ax.legend()
Out[24]:
<matplotlib.legend.Legend at 0x18c348310>
In [25]:
# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(10,6))
merged.plot.scatter(x='gdppc_WB', y='gdppc_CIA', ax=ax, label='WB-CIA', c='r')
merged.plot.scatter(x='gdppc_WB', y='gdppc_IMF', ax=ax, label='WB-IMF', c='b')
ax.set_xlabel('World Bank')
ax.set_ylabel('Other Source')
ax.legend(loc='lower right')
Out[25]:
<matplotlib.legend.Legend at 0x18f402460>

Example - Create Data¶

Let's create a DataFrame with some artificial data for some countries.

  1. Create Series of countries
  2. Create Series of data for each country
  3. Create DataFrame
In [26]:
countries = pd.Series(['Colombia', 'Turkey', 'United States', 'Germany', 'Chile'], name='country')
countries
Out[26]:
0         Colombia
1           Turkey
2    United States
3          Germany
4            Chile
Name: country, dtype: object
In [27]:
print('\n', 'There are ', countries.shape[0], 'countries in this series.')
 There are  5 countries in this series.

We can apply a function on the data using the apply method.¶

E.g., let's find the length of each country name using the len function.

In [28]:
countries.apply(len)
Out[28]:
0     8
1     6
2    13
3     7
4     5
Name: country, dtype: int64

Let's create the data using some of the functions we already learned.

In [29]:
np.random.seed(123456)
data = pd.Series(np.random.normal(size=(countries.shape)), name='noise')
data
Out[29]:
0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
Name: noise, dtype: float64

We can perform certain computation using some of the properties of the pd.Series:¶

  • mean
  • std
  • max
In [30]:
print('\n', 'The average in this sample is ', data.mean())
print('\n', 'The average in this sample is ', "{:.2f}".format(data.mean()))
print('\n', 'The maximum in this sample is ', "{:.2f}".format(data.max()))
print('\n', 'The standard deviation in this sample is ', "{:.2f}".format(data.std()))
 The average in this sample is  -0.24926597871826645

 The average in this sample is  -0.25

 The maximum in this sample is  1.21

 The standard deviation in this sample is  1.12

Here we have used the mean() function of the series to compute its mean. There are many other properties/functions for these series including std(), shape, count(), max(), min(), etc. You can access these by writing series.name_of_function_or_property. To see what functions are available you can hit tab after writing series..

We can transform the data using the apply method¶

In [31]:
data.apply(np.exp)
Out[31]:
0    1.598575
1    0.753623
2    0.221118
3    0.321219
4    3.360575
Name: noise, dtype: float64

Let's create a pd.DataFrame using these two series.¶

Method 1¶

In [32]:
df = pd.DataFrame([countries, data])
df
Out[32]:
0 1 2 3 4
country Colombia Turkey United States Germany Chile
noise 0.469112 -0.282863 -1.509059 -1.135632 1.212112

Not exactly what we'd like, but don't worry, we can just transpose it so it has each country with its data in a row.

In [33]:
df = df.T
df
Out[33]:
country noise
0 Colombia 0.469112
1 Turkey -0.282863
2 United States -1.509059
3 Germany -1.135632
4 Chile 1.212112

Method 2¶

In [34]:
df = pd.concat([countries, data], axis=1)
df
Out[34]:
country noise
0 Colombia 0.469112
1 Turkey -0.282863
2 United States -1.509059
3 Germany -1.135632
4 Chile 1.212112

Method 3¶

In [35]:
df = pd.DataFrame({'country':countries,
                   'noise':data})
df
Out[35]:
country noise
0 Colombia 0.469112
1 Turkey -0.282863
2 United States -1.509059
3 Germany -1.135632
4 Chile 1.212112

Adding more variables/rows¶

Now let us add some more data to this dataframe. This is done easily by defining a new column.

Let's create the square of noise, create the sum of noise and its square, and get the length of the country's name.

In [36]:
df['noise_sq'] = df.noise**2
df['noise and its square'] = df.noise + df.noise_sq
df['name length'] = df.country.apply(len)
df
Out[36]:
country noise noise_sq noise and its square name length
0 Colombia 0.469112 0.220066 0.689179 8
1 Turkey -0.282863 0.080012 -0.202852 6
2 United States -1.509059 2.277258 0.768199 13
3 Germany -1.135632 1.289661 0.154029 7
4 Chile 1.212112 1.469216 2.681328 5

This shows some of the ways in which you can create new data.

Especially useful is the apply method, which applies a function to the series.

You can also apply a function to the whole dataframe, which is useful if you want to perform computations using various columns.

Let's see some other ways in which we can interact with dataframes.

First, let's identify some observations, e.g., all countries in the South America.¶

Let's create a list of South American countries¶

In [37]:
south_america = ['Colombia', 'Chile']

Let's create a new dummy variable that identifies countries in South America¶

using apply and our south_america list¶

In [38]:
df['South America Logical'] = df.country.apply(lambda x: x in south_america)
df
Out[38]:
country noise noise_sq noise and its square name length South America Logical
0 Colombia 0.469112 0.220066 0.689179 8 True
1 Turkey -0.282863 0.080012 -0.202852 6 False
2 United States -1.509059 2.277258 0.768199 13 False
3 Germany -1.135632 1.289661 0.154029 7 False
4 Chile 1.212112 1.469216 2.681328 5 True

Notice the new column takes on logical values, i.e., True or False¶

More useful to have numerical values, where 1:True and 0:False¶

Method 1: Dictionary and map¶

In [39]:
mydict = {True:1,
          False:0}
df['South America Dict'] = df['South America Logical'].map(mydict)
df
Out[39]:
country noise noise_sq noise and its square name length South America Logical South America Dict
0 Colombia 0.469112 0.220066 0.689179 8 True 1
1 Turkey -0.282863 0.080012 -0.202852 6 False 0
2 United States -1.509059 2.277258 0.768199 13 False 0
3 Germany -1.135632 1.289661 0.154029 7 False 0
4 Chile 1.212112 1.469216 2.681328 5 True 1

Method 2: Change type¶

In [40]:
df['South America'] = df.country.apply(lambda x: x in south_america).astype(int)
df
Out[40]:
country noise noise_sq noise and its square name length South America Logical South America Dict South America
0 Colombia 0.469112 0.220066 0.689179 8 True 1 1
1 Turkey -0.282863 0.080012 -0.202852 6 False 0 0
2 United States -1.509059 2.277258 0.768199 13 False 0 0
3 Germany -1.135632 1.289661 0.154029 7 False 0 0
4 Chile 1.212112 1.469216 2.681328 5 True 1 1

Exercises
¶

Exercise 0: Create a new folder and notebook called Intro-Data-Analysis-Pandas-Exercises where you will perform all the exercises below. Make sure to copy any code you need from this notebook to that one. You will use that folder to create a new GitHub repo with the code, html, and slides as usual.
Exercise 1: Create a new dataframe pop with population data downloaded from Wikipedia. Make sure to clean the data so it can be used further.
Exercise 2: Merge the isocodes and pop dataframes.
Exercise 3: Merge the dataframes we have created so far to have a unique dataframe that has ISO codes, GDP per capita, and population data.
Exercise 4: Use the os package to create folders to export data and figures. Since you will be using the names of these folders a lot, save their names in variables called path, pathout, and pathgraphs, where path = './data/', pathout = './data/', and pathgraphs = './graphs/'
Exercise 5: Save the dataframe created in Exercise 3 as a CSV, XLSX, and Stata file into the pathout folder. Use a variable called filename = 'Wiki_Data' so you can use similar code to save all file types. Notice only the filetype will change.
Exercise 6: Create plots showing the relation between GDP per capita and Population. Create all 4 types of possible regression plots and save them as PNG, PDF, and JPG files. Make sure to save them in the folder you created for graphs
Exercise 7: Create plots showing the relation between GDP per capita and Population Growth. Create all 4 types of possible regression plots and save them as PNG, PDF, and JPG files. Make sure to save them in the folder you created for graphs
Exercise 8: Using the notebook create slides for presenting your work and results. Once you have your slides, create a new public repo, publish it, and make sure to create a READ.ME file that show links to the notebook, html, and slides. Also, create the gh-pages branch to have a working slides webpage.

Notebook written by Ömer Özak for his students in Economics at Southern Methodist University. Feel free to use, distribute, or contribute.