import pandas as pd


# Let's import pandas and some other basic packages we will use 
from __future__ import division
%matplotlib inline
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Import Existing Data ¶

my_new_series = pd.Series(a_list_with_data, 
                        name='Name of Column in data table')

df = pd.DataFrame(a_list_of_series_or_raw_data,
                columns=list_with_column_names,
                index=list_of_indices_if_you_need)

df['name of column']

df[list_of_columns]

df.iloc[[0,2,5,9]]

df.loc[df['variable_1']>0]

df.loc[condition to be satisfied by data in df]

df.loc[condition to be satisfied by data in df, list_of_columns]

df.plot()

df['variable'].plot()

df.plot.scatter(x='Variable X', y='Variable Y')

df['Y'] = 2 * df['X']**2 - df['Z'].apply(np.log)

df['new variable'] = list_or_array

df.apply(my_function, axis=1)

df['My Column'].apply(my_function, axis=1)

df.describe()

df.groupby(['Variable1', 'Variable2']).statistic()

df.groupby(['Variable1', 'Variable2']).apply(my_function)

Original dataframe has units (countries, households, individuals) in rows and variables or yearly observations in columns ¶

Reshape DataFrame From Long To Wide ¶

Original dataframe has unit $\times$ variable/year in each row and values in columns ¶

Combine Data from Multiple DataFrames ¶

Concatenating Dataframes ¶

pandas let's you easily concatenate various `Series` or `DataFrames` to create a new `DataFrame`¶

pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

pd.concat([df1, df2, df3])

pd.concat([df1, df4], axis=1)

pd.concat([df1, df4], axis=1, join='inner')

Merging/Joining Dataframes ¶

pandas let's you easily merge/join `Series` or `DataFrames` to create a new `DataFrame`¶

pd.merge([left, right])

pd.merge([left, right])

pd.merge([left, right], how="left", on=["key1", "key2"])

pd.merge([left, right], how="right", on=["key1", "key2"])

pd.merge([left, right], how="outer", on=["key1", "key2"])

Many Other Options and Possibilities ¶


# Import display options for showing websites
from IPython.display import IFrame
url = 'https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'
IFrame(url, width=800, height=400)


isocodes = pd.read_html(url, encoding='utf-8')[0]
isocodes


isocodes.columns

MultiIndex([(       'ISO 3166[1]',           'Country name[5]'),
            ('Unnamed: 1_level_0',    'Official state name[6]'),
            ('Unnamed: 2_level_0',      'Sovereignty[6][7][8]'),
            (     'ISO 3166-1[2]',           'Alpha-2 code[5]'),
            (     'ISO 3166-1[2]',           'Alpha-3 code[5]'),
            (     'ISO 3166-1[2]',           'Numeric code[5]'),
            (     'ISO 3166-2[3]', 'Subdivision code links[3]'),
            ('Unnamed: 7_level_0',         'Internet ccTLD[9]')],
           )


isocodes = isocodes.droplevel(0, axis=1)
isocodes.head()


mycols = isocodes.columns
mycols = [c[:c.find('[')] for c in mycols]
mycols

['Country name',
 'Official state name',
 'Sovereignty',
 'Alpha-2 code',
 'Alpha-3 code',
 'Numeric code',
 'Subdivision code links',
 'Internet ccTLD']


isocodes.columns = mycols
isocodes.head()


isocodes['Alpha-2 code original'] = isocodes['Alpha-2 code']
isocodes['Alpha-2 code'] = isocodes['Subdivision code links'].apply(lambda x: x[x.find(':')+1:])
isocodes.head()


url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita'
IFrame(url, width=800, height=400)


gdppc_wiki = pd.read_html(url, encoding='utf-8')[1]
gdppc_wiki


gdppc_wiki.columns = ['Country/Territory', 'UN Region', 'gdppc_IMF', 'year_IMF',
                      'gdppc_WB', 'year_WB', 'gdppc_CIA', 'year_CIA']
gdppc_wiki.head()


gdppc_wiki['country_name'] = gdppc_wiki['Country/Territory'].str.replace('*', '', regex=True).str.strip()
gdppc_wiki.head()


gdppc_wiki.dtypes

Country/Territory    object
UN Region            object
gdppc_IMF            object
year_IMF             object
gdppc_WB             object
year_WB              object
gdppc_CIA            object
year_CIA             object
country_name         object
dtype: object


for c in gdppc_wiki.columns[2:-1]:
    gdppc_wiki[c] = pd.to_numeric(gdppc_wiki[c].str.replace('—', 'nan'), errors='coerce')
    if c.startswith('year'):
        gdppc_wiki[c] = gdppc_wiki[c].astype('Int64')


gdppc_wiki.dtypes

Country/Territory     object
UN Region             object
gdppc_IMF            float64
year_IMF               Int64
gdppc_WB             float64
year_WB                Int64
gdppc_CIA            float64
year_CIA               Int64
country_name          object
dtype: object


isocodes.head(2)


gdppc_wiki.head(1)


merged = isocodes.merge(gdppc_wiki, left_on='Country name', right_on='country_name')
merged


merged.shape

(174, 18)


isocodes_names = set(isocodes['Country name'])
gdppc_wiki_names = set(gdppc_wiki['country_name'])


isocodes_names.difference(gdppc_wiki_names)

{'Antarctica\u200a[a]',
 'Australia\u200a[b]',
 'Bahamas (the)',
 'Bolivia (Plurinational State of)',
 'Bonaire\xa0Sint Eustatius\xa0Saba',
 'Bouvet Island',
 'British Indian Ocean Territory (the)',
 'British Virgin Islands – See Virgin Islands (British).',
 'Brunei Darussalam\u200a[e]',
 'Burma – See Myanmar.',
 'Cabo Verde\u200a[f]',
 'Cape Verde – See Cabo Verde.',
 'Caribbean Netherlands – See Bonaire, Sint Eustatius and Saba.',
 'Cayman Islands (the)',
 'Central African Republic (the)',
 'China, The Republic of – See Taiwan (Province of China).',
 'Christmas Island',
 'Cocos (Keeling) Islands (the)',
 'Comoros (the)',
 'Congo (the Democratic Republic of the)',
 'Congo (the)\u200a[g]',
 'Cook Islands (the)',
 'Czechia\u200a[i]',
 "Côte d'Ivoire\u200a[h]",
 "Democratic People's Republic of Korea – See Korea, The Democratic People's Republic of.",
 'Democratic Republic of the Congo – See Congo, The Democratic Republic of the.',
 'Dominican Republic (the)',
 'East Timor – See Timor-Leste.',
 'Eswatini\u200a[j]',
 'Falkland Islands (the) [Malvinas]\u200a[k]',
 'Faroe Islands (the)',
 'France\u200a[l]',
 'French Guiana',
 'French Southern Territories (the)\u200a[m]',
 'Gambia (the)',
 'Great Britain – See United Kingdom, The.',
 'Guadeloupe',
 'Heard Island and McDonald Islands',
 'Holy See (the)\u200a[n]',
 'Iran (Islamic Republic of)',
 "Ivory Coast – See Côte d'Ivoire.",
 'Jan Mayen – See Svalbard and Jan Mayen.',
 "Korea (the Democratic People's Republic of)\u200a[o]",
 'Korea (the Republic of)\u200a[p]',
 "Lao People's Democratic Republic (the)\u200a[q]",
 'Macao\u200a[r]',
 'Marshall Islands (the)',
 'Martinique',
 'Mayotte',
 'Micronesia (Federated States of)',
 'Moldova (the Republic of)',
 'Myanmar\u200a[t]',
 'Netherlands (the)',
 'Niger (the)',
 'Norfolk Island',
 "North Korea – See Korea, The Democratic People's Republic of.",
 'North Macedonia\u200a[s]',
 'Northern Mariana Islands (the)',
 'Palestine, State of',
 "People's Republic of China – See China.",
 'Philippines (the)',
 'Pitcairn\u200a[u]',
 'Republic of China – See Taiwan (Province of China).',
 'Republic of Korea – See Korea, The Republic of.',
 'Republic of the Congo – See Congo, The.',
 'Russian Federation (the)\u200a[v]',
 'Réunion',
 'Saba – See Bonaire, Sint Eustatius and Saba.',
 'Sahrawi Arab Democratic Republic – See Western Sahara.',
 'Saint Barthélemy',
 'Saint Helena\xa0Ascension Island\xa0Tristan da Cunha',
 'Saint Martin (French part)',
 'Sao Tome and Principe',
 'Sint Eustatius – See Bonaire, Sint Eustatius and Saba.',
 'Sint Maarten (Dutch part)',
 'South Georgia and the South Sandwich Islands',
 'South Korea – See Korea, The Republic of.',
 'Sudan (the)',
 'Svalbard\xa0Jan Mayen',
 'Syrian Arab Republic (the)\u200a[x]',
 'Taiwan (Province of China)\u200a[y]',
 'Tanzania, the United Republic of',
 'Timor-Leste\u200a[aa]',
 'Turks and Caicos Islands (the)',
 'Türkiye',
 'United Arab Emirates (the)',
 'United Kingdom of Great Britain and Northern Ireland (the)',
 'United States Minor Outlying Islands (the)\u200a[ac]',
 'United States Virgin Islands – See Virgin Islands (U.S.).',
 'United States of America (the)',
 'Vatican City – See Holy See, The.',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam\u200a[ae]',
 'Virgin Islands (British)\u200a[af]',
 'Virgin Islands (U.S.)\u200a[ag]',
 'Western Sahara\u200a[ah]',
 'Åland Islands'}


gdppc_wiki_names.difference(isocodes_names)

{'Australia',
 'Bahamas',
 'Bolivia',
 'British Virgin Islands',
 'Brunei',
 'Cape Verde',
 'Cayman Islands',
 'Central African Republic',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Czech Republic',
 'DR Congo',
 'Dominican Republic',
 'East Timor',
 'Eswatini',
 'European Union',
 'Falkland Islands',
 'Faroe Islands',
 'France',
 'Gambia',
 'Iran',
 'Ivory Coast',
 'Kosovo',
 'Laos',
 'Macau',
 'Marshall Islands',
 'Micronesia',
 'Moldova',
 'Myanmar',
 'Netherlands',
 'Niger',
 'North Korea',
 'North Macedonia',
 'Northern Mariana Islands',
 'Palestine',
 'Philippines',
 'Russia',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Martin',
 'Sint Maarten',
 'South Korea',
 'Sudan',
 'Syria',
 'São Tomé and Príncipe',
 'Taiwan',
 'Tanzania',
 'Turkey',
 'Turks and Caicos Islands',
 'U.S. Virgin Islands',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Venezuela',
 'Vietnam',
 'World'}


# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(10,6))
merged.gdppc_CIA.plot.kde(ax=ax, label='CIA')
merged.gdppc_IMF.plot.kde(ax=ax, label='IMF')
merged.gdppc_WB.plot.kde(ax=ax, label='WB')
ax.legend()

<matplotlib.legend.Legend at 0x18c3ff790>


# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(10,6))
merged.gdppc_CIA.plot.hist(ax=ax, label='CIA')
merged.gdppc_IMF.plot.hist(ax=ax, label='IMF', alpha=0.6)
merged.gdppc_WB.plot.hist(ax=ax, label='WB', alpha=0.3)
ax.legend()

<matplotlib.legend.Legend at 0x18c348310>


# Set the size of the figure and get a figure and axis object
fig, ax = plt.subplots(figsize=(10,6))
merged.plot.scatter(x='gdppc_WB', y='gdppc_CIA', ax=ax, label='WB-CIA', c='r')
merged.plot.scatter(x='gdppc_WB', y='gdppc_IMF', ax=ax, label='WB-IMF', c='b')
ax.set_xlabel('World Bank')
ax.set_ylabel('Other Source')
ax.legend(loc='lower right')

<matplotlib.legend.Legend at 0x18f402460>


countries = pd.Series(['Colombia', 'Turkey', 'United States', 'Germany', 'Chile'], name='country')
countries

0         Colombia
1           Turkey
2    United States
3          Germany
4            Chile
Name: country, dtype: object


print('\n', 'There are ', countries.shape[0], 'countries in this series.')

 There are  5 countries in this series.


countries.apply(len)

0     8
1     6
2    13
3     7
4     5
Name: country, dtype: int64


np.random.seed(123456)
data = pd.Series(np.random.normal(size=(countries.shape)), name='noise')
data

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
Name: noise, dtype: float64


print('\n', 'The average in this sample is ', data.mean())
print('\n', 'The average in this sample is ', "{:.2f}".format(data.mean()))
print('\n', 'The maximum in this sample is ', "{:.2f}".format(data.max()))
print('\n', 'The standard deviation in this sample is ', "{:.2f}".format(data.std()))

 The average in this sample is  -0.24926597871826645

 The average in this sample is  -0.25

 The maximum in this sample is  1.21

 The standard deviation in this sample is  1.12


data.apply(np.exp)

0    1.598575
1    0.753623
2    0.221118
3    0.321219
4    3.360575
Name: noise, dtype: float64


df = pd.DataFrame([countries, data])
df


df = df.T
df


df = pd.concat([countries, data], axis=1)
df


df = pd.DataFrame({'country':countries,
                   'noise':data})
df


df['noise_sq'] = df.noise**2
df['noise and its square'] = df.noise + df.noise_sq
df['name length'] = df.country.apply(len)
df


south_america = ['Colombia', 'Chile']


df['South America Logical'] = df.country.apply(lambda x: x in south_america)
df


mydict = {True:1,
          False:0}
df['South America Dict'] = df['South America Logical'].map(mydict)
df


df['South America'] = df.country.apply(lambda x: x in south_america).astype(int)
df

	ISO 3166[1]	Unnamed: 1_level_0	Unnamed: 2_level_0	ISO 3166-1[2]			ISO 3166-2[3]	Unnamed: 7_level_0
	Country name[5]	Official state name[6]	Sovereignty[6][7][8]	Alpha-2 code[5]	Alpha-3 code[5]	Numeric code[5]	Subdivision code links[3]	Internet ccTLD[9]
0	Afghanistan	The Islamic Republic of Afghanistan	UN member state	.mw-parser-output .monospaced{font-family:mono...	AFG	004	ISO 3166-2:AF	.af
1	Åland Islands	Åland	Finland	AX	ALA	248	ISO 3166-2:AX	.ax
2	Albania	The Republic of Albania	UN member state	AL	ALB	008	ISO 3166-2:AL	.al
3	Algeria	The People's Democratic Republic of Algeria	UN member state	DZ	DZA	012	ISO 3166-2:DZ	.dz
4	American Samoa	The Territory of American Samoa	United States	AS	ASM	016	ISO 3166-2:AS	.as
...	...	...	...	...	...	...	...	...
266	Wallis and Futuna	The Territory of the Wallis and Futuna Islands	France	WF	WLF	876	ISO 3166-2:WF	.wf
267	Western Sahara [ah]	The Sahrawi Arab Democratic Republic	Disputed [ai]	EH	ESH	732	ISO 3166-2:EH	[aj]
268	Yemen	The Republic of Yemen	UN member state	YE	YEM	887	ISO 3166-2:YE	.ye
269	Zambia	The Republic of Zambia	UN member state	ZM	ZMB	894	ISO 3166-2:ZM	.zm
270	Zimbabwe	The Republic of Zimbabwe	UN member state	ZW	ZWE	716	ISO 3166-2:ZW	.zw

	Country name[5]	Official state name[6]	Sovereignty[6][7][8]	Alpha-2 code[5]	Alpha-3 code[5]	Numeric code[5]	Subdivision code links[3]	Internet ccTLD[9]
0	Afghanistan	The Islamic Republic of Afghanistan	UN member state	.mw-parser-output .monospaced{font-family:mono...	AFG	004	ISO 3166-2:AF	.af
1	Åland Islands	Åland	Finland	AX	ALA	248	ISO 3166-2:AX	.ax
2	Albania	The Republic of Albania	UN member state	AL	ALB	008	ISO 3166-2:AL	.al
3	Algeria	The People's Democratic Republic of Algeria	UN member state	DZ	DZA	012	ISO 3166-2:DZ	.dz
4	American Samoa	The Territory of American Samoa	United States	AS	ASM	016	ISO 3166-2:AS	.as

	Country name	Official state name	Sovereignty	Alpha-2 code	Alpha-3 code	Numeric code	Subdivision code links	Internet ccTLD
0	Afghanistan	The Islamic Republic of Afghanistan	UN member state	.mw-parser-output .monospaced{font-family:mono...	AFG	004	ISO 3166-2:AF	.af
1	Åland Islands	Åland	Finland	AX	ALA	248	ISO 3166-2:AX	.ax
2	Albania	The Republic of Albania	UN member state	AL	ALB	008	ISO 3166-2:AL	.al
3	Algeria	The People's Democratic Republic of Algeria	UN member state	DZ	DZA	012	ISO 3166-2:DZ	.dz
4	American Samoa	The Territory of American Samoa	United States	AS	ASM	016	ISO 3166-2:AS	.as

	Country name	Official state name	Sovereignty	Alpha-2 code	Alpha-3 code	Numeric code	Subdivision code links	Internet ccTLD	Alpha-2 code original
0	Afghanistan	The Islamic Republic of Afghanistan	UN member state	AF	AFG	004	ISO 3166-2:AF	.af	.mw-parser-output .monospaced{font-family:mono...
1	Åland Islands	Åland	Finland	AX	ALA	248	ISO 3166-2:AX	.ax	AX
2	Albania	The Republic of Albania	UN member state	AL	ALB	008	ISO 3166-2:AL	.al	AL
3	Algeria	The People's Democratic Republic of Algeria	UN member state	DZ	DZA	012	ISO 3166-2:DZ	.dz	DZ
4	American Samoa	The Territory of American Samoa	United States	AS	ASM	016	ISO 3166-2:AS	.as	AS

	Country/Territory	UN Region	IMF[5][6]		World Bank[7]		CIA[8]
	Country/Territory	UN Region	Estimate	Year	Estimate	Year	Estimate	Year
0	Monaco *	Europe	—	—	190513	2019	115700	2015
1	Liechtenstein *	Europe	—	—	180367	2018	139100	2009
2	Luxembourg *	Europe	140694	2022	118360	2020	110300	2020
3	Singapore *	Asia	131580	2022	98526	2020	93400	2020
4	Ireland *	Europe	124596	2022	93612	2020	89700	2020
...	...	...	...	...	...	...	...	...
225	Somalia *	Africa	1322	2022	875.2	2020	800	2020
226	DR Congo *	Africa	1316	2022	1131	2020	1098	2019
227	Central African Republic *	Africa	1102	2022	979.6	2020	945	2019
228	South Sudan *	Africa	928	2022	1235	2015	1600	2017
229	Burundi *	Africa	856	2022	771.2	2020	700	2020

	0	1	2	3	4
country	Colombia	Turkey	United States	Germany	Chile
noise	0.469112	-0.282863	-1.509059	-1.135632	1.212112

	country	noise	noise_sq	noise and its square	name length
0	Colombia	0.469112	0.220066	0.689179	8
1	Turkey	-0.282863	0.080012	-0.202852	6
2	United States	-1.509059	2.277258	0.768199	13
3	Germany	-1.135632	1.289661	0.154029	7
4	Chile	1.212112	1.469216	2.681328	5

Introduction to Data Analysis in ¶

using ¶

What is ?

What is for?

How to create a `DataFrame`?

Import and Export Data¶

Export Data¶

Create a New DataFrame¶

Select subset of data¶

Select subset of columns¶

Select subset of rows¶

Select subset of rows and columns¶

Plot Data¶

pandas uses matplotlib, so we can pass options or use axes and figures as we learned before¶

We can also pass pandas dataframes to seaborn, statsmodels, and many other packages¶

Create New Columns/Variables¶

From Other Columns/Variables¶

From a list or numpy.array¶

Applying Functions to Dataframe¶

Create Statistics for Columns in Dataframe¶

Apply Functions or Create Aggregate Statistics by Groups of Rows¶

Reshape DataFrame¶

Many times you may need to reshape you data¶

Reshape DataFrame From Wide To Long¶

New dataframe has unit $\times$ variable/year in each row and values in columns¶

Useful commands¶

New dataframe has units (countries, households, individuals) in rows and variables or yearly observations in columns¶

Useful commands¶

Examples¶

Example - Import data¶

Let's import the table of countries' ISO codes from Wikipedia¶

Not perfect, but we can correct it and make it look nice¶

First, let's drop the first column index¶

Second, let's correct column names¶

Third, let's correct Alpha-2 code using Subdivision code links¶

Now, let's import the table of countries' GDP per capita (PPP) from Wikipedia_per_capita)¶

Again we need to clean the data a little bit¶

Let's eliminate the * in the country names¶

Let's make sure years and GDPpc columns are treated as numbers¶

Let's try to merge the data from both dataframes¶

The only common information in both dataframes is the country's name, so let's merge using the corrected country_name and Country name¶

What are the not common country names in both dataframes?¶

Clearly to create the full dataset, we'd need to standardize the country names.¶

This is a major reason to use ISO CODES!¶

We'll learn methods to do this in another lecture, for now let's use the merged subset¶

Simple Plots¶

Example - Create Data¶

We can apply a function on the data using the apply method.¶

We can perform certain computation using some of the properties of the pd.Series:¶

We can transform the data using the apply method¶

Let's create a pd.DataFrame using these two series.¶

Method 1¶

Method 2¶

Method 3¶

Adding more variables/rows¶

First, let's identify some observations, e.g., all countries in the South America.¶

Let's create a list of South American countries¶

Let's create a new dummy variable that identifies countries in South America¶

using apply and our south_america list¶

Notice the new column takes on logical values, i.e., True or False¶

More useful to have numerical values, where 1:True and 0:False¶

Method 1: Dictionary and map¶

Method 2: Change type¶

Exercises¶

Create a New `DataFrame`¶

pandas uses `matplotlib`, so we can pass options or use axes and figures as we learned before¶

From a `list` or `numpy.array`¶

Let's import the table of countries' ISO codes from Wikipedia ¶

Third, let's correct `Alpha-2 code` using `Subdivision code links`¶

The only common information in both dataframes is the country's name, so let's merge using the corrected `country_name` and `Country name`¶

We can apply a function on the data using the `apply` method.¶

We can perform certain computation using some of the properties of the `pd.Series`:¶

We can transform the data using the `apply` method¶

Let's create a `pd.DataFrame` using these two series.¶

using `apply` and our `south_america` list¶

Notice the new column takes on logical values, i.e., `True` or `False`¶

More useful to have numerical values, where `1:True` and `0:False`¶

Method 1: Dictionary and `map`¶