Data Policy Analyst Data Policy Analyst - Posts about data analysis

Using the Media Cloud API to Query and Construct Country-Level Datasets

2025-10-24
In today’s data-driven media landscape, understanding how news and public discourse evolve across time and geography has become critical for researchers, journalists, and policymakers. Media Cloud, an open-source platform developed by the MIT Center for Civic Media and Harvard’s Berkman Klein Center, offers a powerful way to explore such questions through structured media data. Its API gives users direct programmatic access to millions of news stories collected from thousands of media sources around the world, enabling large-scale text analysis, topic tracking, and narrative comparison.

This blog introduces how to use the Media Cloud API effectively from retrieving topic-specific story data and analyzing word frequencies to building visualizations that capture media attention patterns. Whether you are a data scientist studying agenda-setting or a policy researcher examining framing differences across regions, Media Cloud’s API provides a transparent, reproducible pathway for analyzing media ecosystems at scale.

Get API Key

To begin, please create an account and submit a request for API access on your profile page at Media Cloud. Following your request, you will receive an email containing your API key.

Before proceeding to other codes, please import the following libraries. For “MC_API_KEY”, enter your API key.
import os, mediacloud.api import pandas as pd import numpy as np import json from importlib.metadata import version from dotenv import load_dotenv import datetime as dt from datetime import datetime from IPython.display import JSON import bokeh.io import pycountry from bokeh.plotting import figure, show from bokeh.models import ColumnDataSource import re bokeh.io.reset_output() bokeh.io.output_notebook() MC_API_KEY = 'Your API Key' search_api = mediacloud.api.SearchApi(MC_API_KEY) f'Using Media Cloud python client v{version("mediacloud")}'
```
import os, mediacloud.api
import pandas as pd
import numpy as np
import json
from importlib.metadata import version
from dotenv import load_dotenv
import datetime as dt
from datetime import datetime
from IPython.display import JSON
import bokeh.io
import pycountry
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
import re
bokeh.io.reset_output()
bokeh.io.output_notebook()
MC_API_KEY = 'Your API Key'
search_api = mediacloud.api.SearchApi(MC_API_KEY)
f'Using Media Cloud python client v{version("mediacloud")}'
```
National Sources List

Fortunately, Media Cloud offers a “National Collection” list that encompasses domestic affairs within individual countries. However, to obtain data for each country, it is necessary to have an ‘ID’ corresponding to the national collection for that country. In this regard, I discovered a URL that contains comprehensive information on all sources within Media Cloud. I transformed this JSON format into a data frame and refined it to retain only the national collection. Notably, the national collection ID includes the string ‘National’ in the name column.
## s = save the text from the URL data = json.loads(s) lst = data["results"] df = pd.DataFrame(data["results"]) df_nat = df[df['name'].str.contains('National', na=False)].reset_index(drop=True) df_nat["country"] = df_nat["name"].str.replace(" - National", "", regex=False) def get_iso3(name): try: return pycountry.countries.lookup(name).alpha_3 except LookupError: return None df_nat["ISO3"] = df_nat["country"].apply(get_iso3) iso_map = { "Turkey": "TUR", "Kosovo": "XKX", "Sri Lanka": "LKA", "Macedonia, Republic of": "MKD", "Saint Helena": "SHN" } df_nat.loc[df_nat['name']=='Sri Lanka- National', "country"] = 'Sri Lanka' df_nat.loc[df_nat['ISO3'].isna(), "ISO3"] = df_nat["country"].map(iso_map) df_nat=df_nat[df_nat["ISO3"].notna()] df_nat.head(10) print(df_nat.shape)
```
## s = save the text from the URL
data = json.loads(s)                
lst  = data["results"]              

df = pd.DataFrame(data["results"])

df_nat = df[df['name'].str.contains('National', na=False)].reset_index(drop=True)
df_nat["country"] = df_nat["name"].str.replace(" - National", "", regex=False)

def get_iso3(name):
    try:
        return pycountry.countries.lookup(name).alpha_3
    except LookupError:
        return None  
df_nat["ISO3"] = df_nat["country"].apply(get_iso3)

iso_map = {
    "Turkey": "TUR",
    "Kosovo": "XKX",
    "Sri Lanka": "LKA",
    "Macedonia, Republic of": "MKD",
    "Saint Helena": "SHN"
}
df_nat.loc[df_nat['name']=='Sri Lanka- National', "country"] = 'Sri Lanka'
df_nat.loc[df_nat['ISO3'].isna(), "ISO3"] = df_nat["country"].map(iso_map)

df_nat=df_nat[df_nat["ISO3"].notna()]
df_nat.head(10)
print(df_nat.shape)
```
This will provide the subsequent formatted data. For the API query, it is essential to include the id, country, and ISO3 codes.

Key Features of Query Syntax

There exist numerous query features that facilitate users in accessing the relevant APIs. Among these features, several are especially advantageous. I have found the proximity search to be particularly beneficial for my data collection (Query Guide).
- Wildcards
  - Single-character wildcard: e.g., wom?n matches “woman” or “women”.
  - Multi-character wildcard: e.g., wom*n or wom* matches “woman”, “women”, and also “womxn”.
- Hyphens & quoted phrases
  - Terms that include hyphens must be enclosed in quotes: e.g., "two-spirit".
  - To search an exact phrase, use double quotes: e.g., "unhoused person".
- Default logic / Boolean operators
  - Terms placed next to each other without explicit operators are implicitly ANDed (e.g., monkey banana = monkey AND banana).
  - Negation is supported: prefix a term with NOT or - to exclude it (e.g., gaza NOT hamas).
- Columns / field‐specific queries
  - You can specify the field to be searched, for example: title: biden or article_title: biden.
- Language filter
  - To restrict to a specific language, include language:en (or the relevant ISO 639-1 code).
- Unsupported features
  - Regular expressions are not supported.
  - Fuzzy queries (e.g., Biden~ to match “Bidan”, “Bidden”) are not supported.
- Proximity search
  - You can search for words within a certain distance of each other: e.g., "woman killed"~10 finds cases where “woman” and “killed” are within 10 words.
Query Example

The following code delineates the API query utilized to engage with the National Collection. The example illustrates that the query incorporates the terms ‘fiscal deficit’, ‘budget deficit’, and ‘public debt’, employing a proximity search of 10 words. Boolean operators, such as AND or OR, may be utilized in accordance with your preferred search.
```
## United Kingdom's National Collection 
start_date = dt.date(2025, 9, 1)
end_date = dt.date(2025, 9,30)
sources = [34412476] ## UK National Collection ID
my_query = '"fiscal deficit"~10 OR "budget deficit"~10 OR "public debt"~10'
api_result = search_api.story_count_over_time(my_query, start_date, end_date, collection_ids=sources)
api_result[2]
```
The outcome is as follows: it acquires the date, total count, count, and ratio. The ratio, referred to as attention over time, represents media attention focused on specific keywords. This metric assesses the proportion of total media coverage allocated to the query terms within a specified time frame. In other words, it illustrates the relative amount of attention a topic has garnered compared to all other narratives.

Once the variable api_result is converted to a DataFrame, the structure of the data appears as illustrated below. To create a panel dataset, it is essential to implement a ‘for loop’ that retrieves data based on ISO3 codes and their corresponding National Collection IDs, subsequently appending all collected data at the conclusion of the process.
```
df = pd.DataFrame(api_result)
df
```
As you can see from the data, Media Cloud provides not only the number of articles including specific keywords but also the ratio of the keyword that shows the attention of certain topics over time. This feature allows researchers and analysts to identify trends and shifts in public interest, making it an essential tool for anyone looking to understand the evolving landscape of media discourse. Especially, it provides valuable daily datasets, which also can be converted into monthly, quarterly, and yearly datasets, thereby ensuring that users can conduct longitudinal studies with ease.

These datasets can be instrumental when utilized in econometric analysis, allowing for more informed decision-making based on historical trends and projected outcomes. Furthermore, the ability to break down data by various time frames enhances the granularity of insights, enabling users to pinpoint not only what topics are gaining traction but also when these shifts occur, thereby providing a comprehensive understanding of media dynamics.

Building a Clean Macroeconomic Panel from the IMF World Economic Outlook Database

2025-10-21

One of the most valuable datasets for macroeconomic analysis is the IMF World Economic Outlook. In this post, we’ll explore some code solutions to prepare and clean the WEO dataset, transforming it into a panel-ready structure that’s perfect for econometric studies or policy analysis. Stay tuned as we delve into the steps needed to get your data ready for rigorous analysis.

First, the WEO dataset can be accessed via the following link. Download it and import it into Python.

df = pd.read_excel('WEOApr2025all.xlsx')
df=df[df['ISO'].notna()] 
df.rename(columns={'ISO':'ISO3'},inplace=True)
df.head()

df = pd.read_excel('WEOApr2025all.xlsx')
df=df[df['ISO'].notna()] 
df.rename(columns={'ISO':'ISO3'},inplace=True)
df.head()

WEO Dictionary

This block constructs a concise reference table mapping WEO subject codes to their descriptive metadata which are subject descriptor, source notes, units, and scale. It removes duplicates to ensure each subject code is uniquely represented. Although saving to Excel is commented out, this dictionary is useful for documentation or later merging variable definitions.

# Dictionary
dict_weo=df[[ 'WEO Subject Code','Subject Descriptor','Subject Notes','Units','Scale']]
dict_weo.drop_duplicates(inplace=True)
#dict_weo.to_excel('weo_dictionary.xlsx',index=False)

# Dictionary
dict_weo=df[[ 'WEO Subject Code','Subject Descriptor','Subject Notes','Units','Scale']]
dict_weo.drop_duplicates(inplace=True)
#dict_weo.to_excel('weo_dictionary.xlsx',index=False)

Dropping Redundant Columns

In this process, extraneous descriptive columns have been eliminated from the primary dataset. Although these fields provide informative insights, they are not requisite for quantitative analysis and merely contribute to an increase in file size. Subsequently, the data is restructured using the melt() and pivot() functions, rendering it suitable for panel analysis.

df.drop(columns=['Subject Descriptor','Subject Notes','Units','Scale','Country/Series-specific Notes','Estimates Start After'],inplace=True)
panel_df = pd.melt(df, id_vars=['WEO Country Code','ISO3', 'Country', 'WEO Subject Code'], 
                   var_name='Year', value_name='Value')

panel_df['Year'] = panel_df['Year'].astype(int)
panel_df['Value'] = pd.to_numeric(panel_df['Value'], errors='coerce')
pivot_df = panel_df.pivot(index=['WEO Country Code', 'ISO3', 'Country', 'Year'], 
                          columns='WEO Subject Code', 
                          values='Value').reset_index()

pivot_df = pivot_df[['WEO Country Code', 'ISO3', 'Country', 'Year',
                     'NGDP_R', 'NGDP_RPCH', 'NGDP', 'NGDPD', 'PPPGDP', 'NGDP_D',
                    'NGDPRPC', 'NGDPRPPPPC', 'NGDPPC', 'NGDPDPC', 'PPPPC',
                    'NGAP_NPGDP', 'PPPSH', 'PPPEX', 'NID_NGDP', 'NGSD_NGDP', 'PCPI',
                    'PCPIPCH', 'PCPIE', 'PCPIEPCH', 'TM_RPCH', 'TMG_RPCH', 'TX_RPCH',
                    'TXG_RPCH', 'LUR', 'LE', 'LP', 'GGR', 'GGR_NGDP', 'GGX',
                    'GGX_NGDP', 'GGXCNL', 'GGXCNL_NGDP', 'GGSB', 'GGSB_NPGDP',
                    'GGXONLB', 'GGXONLB_NGDP', 'GGXWDN', 'GGXWDN_NGDP', 'GGXWDG',
                    'GGXWDG_NGDP', 'NGDP_FY', 'BCA', 'BCA_NGDPD']] 
pivot_df.sort_values(by=['ISO3', 'Year'], inplace=True)
pivot_df.reset_index(drop=True, inplace=True)
pivot_df.columns.name = None
pivot_df

df.drop(columns=['Subject Descriptor','Subject Notes','Units','Scale','Country/Series-specific Notes','Estimates Start After'],inplace=True)
panel_df = pd.melt(df, id_vars=['WEO Country Code','ISO3', 'Country', 'WEO Subject Code'], 
                   var_name='Year', value_name='Value')

panel_df['Year'] = panel_df['Year'].astype(int)
panel_df['Value'] = pd.to_numeric(panel_df['Value'], errors='coerce')
pivot_df = panel_df.pivot(index=['WEO Country Code', 'ISO3', 'Country', 'Year'], 
                          columns='WEO Subject Code', 
                          values='Value').reset_index()

pivot_df = pivot_df[['WEO Country Code', 'ISO3', 'Country', 'Year',
                     'NGDP_R', 'NGDP_RPCH', 'NGDP', 'NGDPD', 'PPPGDP', 'NGDP_D',
                    'NGDPRPC', 'NGDPRPPPPC', 'NGDPPC', 'NGDPDPC', 'PPPPC',
                    'NGAP_NPGDP', 'PPPSH', 'PPPEX', 'NID_NGDP', 'NGSD_NGDP', 'PCPI',
                    'PCPIPCH', 'PCPIE', 'PCPIEPCH', 'TM_RPCH', 'TMG_RPCH', 'TX_RPCH',
                    'TXG_RPCH', 'LUR', 'LE', 'LP', 'GGR', 'GGR_NGDP', 'GGX',
                    'GGX_NGDP', 'GGXCNL', 'GGXCNL_NGDP', 'GGSB', 'GGSB_NPGDP',
                    'GGXONLB', 'GGXONLB_NGDP', 'GGXWDN', 'GGXWDN_NGDP', 'GGXWDG',
                    'GGXWDG_NGDP', 'NGDP_FY', 'BCA', 'BCA_NGDPD']] 
pivot_df.sort_values(by=['ISO3', 'Year'], inplace=True)
pivot_df.reset_index(drop=True, inplace=True)
pivot_df.columns.name = None
pivot_df

Creating New Macroeconomic Variables

The final section enriches the dataset by computing key analytical indicators used in macroeconomic research. I added variables that are used a lot in macroeconomic analysis such as GDP growth rate, economic crisis dummy, GDP per capita, and inflation rate.

pivot_df['GDP_N_GR'] = ((pivot_df['NGDP'] / pivot_df['NGDP'].shift(1) - 1)).where(pivot_df['NGDP'].shift(1).notna())
pivot_df['GDP_R_GR'] = ((pivot_df['NGDP_R'] / pivot_df['NGDP_R'].shift(1) - 1)).where(pivot_df['NGDP_R'].shift(1).notna())

pivot_df['GDP_CRISIS'] = pivot_df['GDP_R_GR'] < 0
pd.set_option('future.no_silent_downcasting', True)
pivot_df['GDP_CRISIS'] = pivot_df['GDP_CRISIS'].replace({True: 1, False: 0})
pivot_df['GDP_R_PC'] = (pivot_df['NGDP_R'] * 1e9) / (pivot_df['LP'] * 1e6)
pivot_df['GDP_R_PC_GR'] = ((pivot_df['GDP_R_PC'] / pivot_df['GDP_R_PC'].shift(1) - 1)).where(pivot_df['GDP_R_PC'].shift(1).notna())
pivot_df['GDP_PPP_PC'] = pivot_df['PPPGDP'] * 1e9 / (pivot_df['LP'] * 1e6)
pivot_df['INFLATION'] = ((pivot_df['PCPI'] / pivot_df['PCPI'].shift(1) - 1)).where(pivot_df['PCPI'].shift(1).notna())
pivot_df['INFLATION_END'] = ((pivot_df['PCPIE'] / pivot_df['PCPIE'].shift(1) - 1)).where(pivot_df['PCPIE'].shift(1).notna())

pivot_df.head()

pivot_df['GDP_N_GR'] = ((pivot_df['NGDP'] / pivot_df['NGDP'].shift(1) - 1)).where(pivot_df['NGDP'].shift(1).notna())
pivot_df['GDP_R_GR'] = ((pivot_df['NGDP_R'] / pivot_df['NGDP_R'].shift(1) - 1)).where(pivot_df['NGDP_R'].shift(1).notna())

pivot_df['GDP_CRISIS'] = pivot_df['GDP_R_GR'] < 0
pd.set_option('future.no_silent_downcasting', True)
pivot_df['GDP_CRISIS'] = pivot_df['GDP_CRISIS'].replace({True: 1, False: 0})
pivot_df['GDP_R_PC'] = (pivot_df['NGDP_R'] * 1e9) / (pivot_df['LP'] * 1e6)
pivot_df['GDP_R_PC_GR'] = ((pivot_df['GDP_R_PC'] / pivot_df['GDP_R_PC'].shift(1) - 1)).where(pivot_df['GDP_R_PC'].shift(1).notna())
pivot_df['GDP_PPP_PC'] = pivot_df['PPPGDP'] * 1e9 / (pivot_df['LP'] * 1e6)
pivot_df['INFLATION'] = ((pivot_df['PCPI'] / pivot_df['PCPI'].shift(1) - 1)).where(pivot_df['PCPI'].shift(1).notna())
pivot_df['INFLATION_END'] = ((pivot_df['PCPIE'] / pivot_df['PCPIE'].shift(1) - 1)).where(pivot_df['PCPIE'].shift(1).notna())

pivot_df.head()

A fully automated pipeline like this not only improves efficiency but also reinforces research transparency and reproducibility. By embedding clear data definitions, consistent transformations, and computed indicators in code rather than manual spreadsheets, it ensures that every analytical step can be replicated, audited, and extended. This approach transforms the IMF WEO dataset from a static source into a living analytical foundation that can be continuously updated and reused for future macroeconomic and fiscal studies.

Where Are Population and Racial Groups Concentrated in Washington, D.C.?

2025-10-19
About a year ago, I decided to dive into learning spatial analysis software, specifically ArcGIS. While I was practicing, I took a look at demographic data for Washington, D.C. In this post, I’m going to share what I found out about how the population is spread across D.C. and where different racial and ethnic groups tend to cluster.

Overall Population Distribution

The population data comes from the U.S. Census (via the open data portal for the District of Columbia). The dataset spans the years 2017-2021 (the ACS Demographic Characteristics for D.C. census tracts).

Washington, D.C.’s total population is 683,154. Within the map shown below, the census-tract maximum is 7,138 people and the minimum is 17 people. Excluding non-residential areas such as the National Mall and areas near the White House, the population appears to be relatively evenly distributed across the city.

Top Four Racial/Ethnic Groups (Black/African American, White, Hispanic/Latino, Asian)

The totals for the four major groups do not exactly match the overall population — likely due to overlapping responses or other survey issues. The four groups shown here are the largest.
- Black/African American: 44.7% (305,109)
- White: 40.5% (276,373)
- Hispanic or Latino: 11.3% (76,982)
- Asian: 4.1% (27,989)
Black/African American vs. White Distribution

After the elimination of discriminatory housing policies in the 1960s, many U.S. cities, including D.C., began to exhibit patterns of residential self-segregation. In the map shown, green represents the white-population share and purple represents the black/African American share. White populations are concentrated in the central and western parts of D.C., whereas black/African American populations are dense in the eastern and southern parts.

Hispanic or Latino

I learned for the first time that the terms “Hispanic” and “Latino” are not the same: broadly, “Hispanic” refers to people from Spanish-speaking countries, and “Latino” refers to people from Latin America or the Caribbean. The map shows Hispanics/Latinos concentrated in the northern parts of Washington, D.C.

Asian

Asians represent a small share (4.1 %) of the total population, but I visualized them nonetheless. The map shows concentration in several downtown-areas near Chinatown and around George Washington University, and some western parts. Note: a strong red hotspot does not imply that this group is large in absolute numbers compared to others—it only shows relative density compared to surrounding areas.

Hot-Spot (Local Moran’s I) Analysis by Race/Ethnicity

Hot-spot analysis is a technique to identify clusters of high or low values in space by computing the Moran’s I statistic (a spatial autocorrelation measure).

In simple terms: on the map, red areas indicate statistically significant clustering of high values (i.e., high concentration), and blue areas indicate clusters of low values (i.e., relative scarcity). White indicates areas that are not statistically significant—perhaps because neighboring regions have mixed values, making interpretations conservative.

Black/African American Hotspots: Eastern and southern D.C. show high clustering of Black/African American populations; the central and western parts show low concentrations relative to the average.

White Hotspots: White populations cluster in central and western D.C., including the Capitol Hill area.

Hispanic or Latino Hotspots: Hispanics/Latinos show clustering in the northern parts of D.C.

Asian Hotspots: Asians are clustered somewhat in the downtown core and north-west areas; again, the absolute numbers are small so one must interpret cautiously.

Summary

Although the overall population of Washington, D.C. is relatively evenly distributed across census tracts, residential patterns by race/ethnicity show stark segmentation. Whites are clustered in the center and west, Black/African Americans in the east and south, Hispanics/Latinos in the north, and Asians in some central and western areas. The hot-spot analysis confirms that these patterns are statistically significant.

Constructing ISO Country List along with ISO2, ISO3, and Continents

2025-10-17

You may download the complete list of countries from the below link.

(2025.10.14)geo_list.xlsx Download

When conducting visualizations or analyses utilizing country panel data, the initial step involves merging data from various sources. Nonetheless, instances may arise where these sources present differing standardized names for countries. To address this challenge, this post focuses on the creation of a comprehensive list of countries alongside their ISO2, ISO3, and corresponding continents. The ISO2 and ISO3 codes represent standardized abbreviations for countries as designated by the ISO (International Organization for Standardization). In this context, a list will be compiled using data from the World Bank (WB), International Monetary Fund (IMF), and United Nations (UN).

World Bank Country List

The following code is from a previous post concerning the World Bank API. In this instance, I retrieved a population data set, retaining only the ISO2, ISO3, country name, and an additional variable indicating whether the observation is sourced from the World Bank. The total number of countries are 217.

def wbcall(ind):
    url = f'http://api.worldbank.org/v2/country/all/indicator/{ind}?&per_page=30000&format=json'
    response = requests.get(url)
    test = response.json()

    if response.status_code != 200 or len(test) < 2:
        print(f"Warning: No data for indicator {ind}")
        return pd.DataFrame()

    a = pd.DataFrame(test[1])
    a['iso2'] = a['country'].apply(lambda x: x.get('id') if isinstance(x, dict) else None)
    a['country_name'] = a['country'].apply(lambda x: x.get('value') if isinstance(x, dict) else None)

    a = a.drop(['indicator', 'country', 'unit', 'obs_status', 'decimal'], axis=1)
    a = a.rename(columns={'countryiso3code': 'iso3', 'value': ind})
    return a

wb_list = wbcall('SP.POP.TOTL')
wb_list = wb_list[['iso2','iso3', 'country_name']].drop_duplicates().reset_index(drop=True)
wb_list['WB']=1
wb_list = wb_list[49:].reset_index(drop=True) # drop continent or other country groups
wb_list.rename(columns={'iso2':'ISO2', 'iso3':'ISO3', 'country_name':'Country_WB'}, inplace=True)
print(wb_list.shape)
wb_list.head()

def wbcall(ind):
    url = f'http://api.worldbank.org/v2/country/all/indicator/{ind}?&per_page=30000&format=json'
    response = requests.get(url)
    test = response.json()

    if response.status_code != 200 or len(test) < 2:
        print(f"Warning: No data for indicator {ind}")
        return pd.DataFrame()

    a = pd.DataFrame(test[1])
    a['iso2'] = a['country'].apply(lambda x: x.get('id') if isinstance(x, dict) else None)
    a['country_name'] = a['country'].apply(lambda x: x.get('value') if isinstance(x, dict) else None)

    a = a.drop(['indicator', 'country', 'unit', 'obs_status', 'decimal'], axis=1)
    a = a.rename(columns={'countryiso3code': 'iso3', 'value': ind})
    return a

wb_list = wbcall('SP.POP.TOTL')
wb_list = wb_list[['iso2','iso3', 'country_name']].drop_duplicates().reset_index(drop=True)
wb_list['WB']=1
wb_list = wb_list[49:].reset_index(drop=True) # drop continent or other country groups
wb_list.rename(columns={'iso2':'ISO2', 'iso3':'ISO3', 'country_name':'Country_WB'}, inplace=True)
print(wb_list.shape)
wb_list.head()

IMF Country List

The source of the country list provided by the IMF is the World Economic Outlook (WEO). This data can be accessed through the following link: https://data.imf.org/en/datasets/IMF.RES:WEO. The dataset includes ISO3 codes, country names, and an additional variable indicating whether the observation is sourced from the IMF. Based on my experience, it appears that there are numerous instances where sources utilize the ISO3 codes for Kosovo and the West Bank and Gaza as XKX and PSE, respectively. Consequently, for the purpose of merging each country list later in this code, I have adjusted the ISO3 codes for these two countries accordingly. The total number of countries included in the dataset is 196.

imf_geo = pd.read_excel(parent_dir / 'IMF/IMF_WEO/(2025.10.05)weo_cleaned.xlsx')
imf_geo=imf_geo[['ISO3','Country']].drop_duplicates().reset_index(drop=True)
imf_geo.rename(columns={ 'Country':'Country_IMF'}, inplace=True)
imf_geo['IMF']=1
imf_geo.loc[imf_geo['Country_IMF'] == 'West Bank and Gaza', 'ISO3'] = 'PSE'
imf_geo.loc[imf_geo['Country_IMF'] == 'Kosovo', 'ISO3'] = 'XKX'
imf_geo.head()

imf_geo = pd.read_excel(parent_dir / 'IMF/IMF_WEO/(2025.10.05)weo_cleaned.xlsx')
imf_geo=imf_geo[['ISO3','Country']].drop_duplicates().reset_index(drop=True)
imf_geo.rename(columns={ 'Country':'Country_IMF'}, inplace=True)
imf_geo['IMF']=1
imf_geo.loc[imf_geo['Country_IMF'] == 'West Bank and Gaza', 'ISO3'] = 'PSE'
imf_geo.loc[imf_geo['Country_IMF'] == 'Kosovo', 'ISO3'] = 'XKX'
imf_geo.head()

UN Country List

The UN source for the country list can be found at the following link (https://geoportal.un.org/arcgis/apps/sites/#/geohub/datasets/21ba52fde4bf4a6989050d55c2fe967d/about). This resource provides a comprehensive list of countries along with their corresponding ISO3 codes and subregions. The total number of countries included in this list is 238. It is important to note that, within the context of the UN, the list encompasses not only sovereign states but also various territories of countries.

un_geo = pd.read_csv(file_path / 'UN_list.csv')
un_geo = un_geo[['iso3cd', 'nam_en', 'subreg', 'intreg']].drop_duplicates().reset_index(drop=True)
un_geo.rename(columns={'iso3cd':'ISO3', 'nam_en':'Country_UN', 'subreg':'Subregion', 'intreg':'Region'}, inplace=True)
un_geo['UN']=1
un_geo = un_geo[un_geo['Country_UN'].notnull() ]
un_geo = un_geo[un_geo['Subregion'].notnull()]
un_geo = un_geo[~un_geo['Country_UN'].isin(['Jammu and Kashmir', 'Gaza','Sint Eustatius', 'Saba', 'Galápagos Islands', 'Canary Islands','Chagos Archipelago', 'Madeira Island', 'Azores Islands', 'Saint Helena'])].reset_index(drop=True)
un_geo.head()

un_geo = pd.read_csv(file_path / 'UN_list.csv')
un_geo = un_geo[['iso3cd', 'nam_en', 'subreg', 'intreg']].drop_duplicates().reset_index(drop=True)
un_geo.rename(columns={'iso3cd':'ISO3', 'nam_en':'Country_UN', 'subreg':'Subregion', 'intreg':'Region'}, inplace=True)
un_geo['UN']=1
un_geo = un_geo[un_geo['Country_UN'].notnull() ]
un_geo = un_geo[un_geo['Subregion'].notnull()]
un_geo = un_geo[~un_geo['Country_UN'].isin(['Jammu and Kashmir', 'Gaza','Sint Eustatius', 'Saba', 'Galápagos Islands', 'Canary Islands','Chagos Archipelago', 'Madeira Island', 'Azores Islands', 'Saint Helena'])].reset_index(drop=True)
un_geo.head()

Merge all three sources

The subsequent code shows a Python package designed to assign the continent to which a country corresponds, based on its ISO2 code.

## Assign continents
import pycountry_convert as pc
def get_continent_from_iso3(iso3):
    try:
        iso2 = pc.country_alpha3_to_country_alpha2(iso3)
        continent_code = pc.country_alpha2_to_continent_code(iso2)
        continent_map = {
            "AF": "Africa",
            "NA": "North America",
            "OC": "Oceania",
            "AN": "Antarctica",
            "AS": "Asia",
            "EU": "Europe",
            "SA": "South America"
        }
        return continent_map[continent_code]
    except Exception:
        return None

## Assign continents
import pycountry_convert as pc
def get_continent_from_iso3(iso3):
    try:
        iso2 = pc.country_alpha3_to_country_alpha2(iso3)
        continent_code = pc.country_alpha2_to_continent_code(iso2)
        continent_map = {
            "AF": "Africa",
            "NA": "North America",
            "OC": "Oceania",
            "AN": "Antarctica",
            "AS": "Asia",
            "EU": "Europe",
            "SA": "South America"
        }
        return continent_map[continent_code]
    except Exception:
        return None

Along with the aforementioned function, the following code consolidates all three sources and assigns certain missing values, such as those for Taiwan and Kosovo. Additionally, I have excluded several regions, including American Samoa and Bermuda, as they are not recognized as sovereign states.

## Merge all sources
geo = pd.merge(wb_list, imf_geo, on='ISO3', how='outer')
geo = pd.merge(geo, un_geo, on='ISO3', how='outer')
geo = geo[geo['Country_WB'].notna() | geo['Country_IMF'].notna()].reset_index(drop=True)
geo.loc[geo['Country_WB'] == 'Kosovo', 'Subregion'] = 'Eastern Europe'
geo.loc[geo['ISO3'] == 'TWN', 'ISO2'] = 'TW'
geo['country_name'] = geo['Country_WB']
geo.loc[geo['ISO3'] == 'TWN', 'country_name'] = 'Taiwan'


geo["Continent"] = geo["ISO3"].apply(get_continent_from_iso3)
geo.loc[geo['Country_WB'] == 'Kosovo', 'Continent'] = 'Europe'
geo.loc[geo['Country_WB'] == 'Timor-Leste', 'Continent'] = 'Asia'
geo = geo[geo['Continent'].notnull()].reset_index(drop=True)

#dropped countries that are not considered countries
geo = geo[~geo['Country_WB'].isin(["American Samoa", "Bermuda", "Curaçao", "Cayman Islands", "Faroe Islands", "Gibraltar", "Greenland", "Guam", "Isle of Man", "Saint Martin (French part)", "Northern Mariana Islands", "New Caledonia", "French Polynesia", "Turks and Caicos Islands", "British Virgin Islands", "United States Virgin Islands"])].reset_index(drop=True)

geo=geo[['ISO2', 'ISO3', 'country_name','Country_WB', 'Country_IMF', 'Country_UN', 'Continent', 'Subregion', 'Region', 'WB', 'IMF', 'UN']]
geo[35:45]

## Merge all sources
geo = pd.merge(wb_list, imf_geo, on='ISO3', how='outer')
geo = pd.merge(geo, un_geo, on='ISO3', how='outer')
geo = geo[geo['Country_WB'].notna() | geo['Country_IMF'].notna()].reset_index(drop=True)
geo.loc[geo['Country_WB'] == 'Kosovo', 'Subregion'] = 'Eastern Europe'
geo.loc[geo['ISO3'] == 'TWN', 'ISO2'] = 'TW'
geo['country_name'] = geo['Country_WB']
geo.loc[geo['ISO3'] == 'TWN', 'country_name'] = 'Taiwan'


geo["Continent"] = geo["ISO3"].apply(get_continent_from_iso3)
geo.loc[geo['Country_WB'] == 'Kosovo', 'Continent'] = 'Europe'
geo.loc[geo['Country_WB'] == 'Timor-Leste', 'Continent'] = 'Asia'
geo = geo[geo['Continent'].notnull()].reset_index(drop=True)

#dropped countries that are not considered countries
geo = geo[~geo['Country_WB'].isin(["American Samoa", "Bermuda", "Curaçao", "Cayman Islands", "Faroe Islands", "Gibraltar", "Greenland", "Guam", "Isle of Man", "Saint Martin (French part)", "Northern Mariana Islands", "New Caledonia", "French Polynesia", "Turks and Caicos Islands", "British Virgin Islands", "United States Virgin Islands"])].reset_index(drop=True)

geo=geo[['ISO2', 'ISO3', 'country_name','Country_WB', 'Country_IMF', 'Country_UN', 'Continent', 'Subregion', 'Region', 'WB', 'IMF', 'UN']]
geo[35:45]

The last data has 203 countries, representing a diverse array of cultures, economies, and geographical features. When merging data using this comprehensive list, you can utilize either ISO2 or ISO3 as a standard column to ensure consistency and accuracy in your datasets. This approach allows for the easy identification and retention of the countries you are particularly interested in, whether they are key partners in trade, regions of focus for development projects, or areas of research exploration. By adhering to these international standards, you can streamline your data processes and enhance the overall quality of your analyses, thereby facilitating more informed decision-making based on reliable information.

Impact of South Korea’s Tobacco Tax Increase: Short-Term Gains and Long-Term Challenges

2025-10-15

This post reviews the paper “Is the Smoking Reduction Effect of the Tobacco Consumption Tax Temporary?” (https://www.dbpia.co.kr/Journal/articleDetail?nodeId=NODE11044324). The study looks at the short-term and long-term effects of South Korea’s tobacco tax increase on January 1, 2015, which raised cigarette prices by 80%. While research shows that tobacco taxes reduce smoking rates in the short term, few studies have examined the lasting effects.

Data and Variables

The analysis uses data from the Korea Welfare Panel Study conducted by the Korea Institute for Health and Social Affairs (KIHASA), covering 7,072 households in 446 survey districts nationwide from 2008 to 2019. Key variables include:

Table 1. Variables and Measurements

Type	Variables	Measurement
Dependent	Smoking Amount	average smoking amount for a day
Independent	Policy	introduction of tax: before 2015 =0, 2015 and after=1
	Year	slope if there was no tax: coded as 2008=1, 2009=2 ~ 2019=12
	Policy*Year	Slope change after the tax
Control	Sex	male=1, female=2
	Education	elementary=0, middle=1, high=2, college(2y)=3, 4y college(4y)=4
	Region	Seoul=1, mega city=2, city=3, gun=4, mixed-gun=5
	Income	disposable income/CPI(2015=100)
	Health	good=1, bad=2
	Drinking	Less than once in a month =1, 2~4 times in a month=2, 2~3 times in a week=3, more than 3 times in a week=4, none=5
	Spouse	no =0, yes=1
	Satisfactory in Life	1=very dissatisfied, 2= dissatisfied, 3=neutral, 4=satisfied, 5=very satisfied

Methodology

To estimate the causal effects, the researchers applied a one-way fixed-effects panel model with an interrupted time-series design. This approach allows comparison between two distinct time periods which are before and after the policy intervention while controlling for unobserved individual characteristics. The fixed-effects model was selected over OLS and random-effects models based on both the F-test and Hausman test, which confirmed the presence of significant individual-specific effects.

To estimate the short-term and long-term effects of the tobacco consumption tax, we used one-way fixed effect panel analysis with interrupted time-series design. The nationwide policy such as tobacco consumption tax in South Korea is a classic subject for interrupted time-series design. The purpose of the interrupted time-series design is to evaluate possible differential performance under the two conditions of the time series. In our case, before and after the tax are the two conditions.

The fixed-effect model assumes that Cov(X, α_i) ≠ 0. When this is analyzed with traditional OLS, the result cannot obtain consistent estimates because of endogenous problem. The fixed-effect model controls the fixed effect of individual observation by removing α_i. To estimate the long-term effect of the tax, we input the Year_t and Policy_t*Year_t. The β_y refers to the slope of smoking amount if there was no tobacco consumption tax while the β_py is a measure of change in slope after the tax. Thus, β_y+β_py is the slope after the tax. The equation is as follows:

Results

The F-test and the Hausman test of the panel analysis were statistically significant at the 1% confidence level and rejected the hypothesis ‘fixed-effects are zero’ and Cov(X, α_i) = 0. When this is analyzed with traditional OLS, the result cannot obtain consistent estimates because of end. Table 2 presents the estimates by income divisions. In all income divisions, β_p which is the short-term effect was statistically significant at 1% confidence level and on average it reduced smoking by 3.39 cigarettes. Also, both the slope without tax (β_y), and slope change after the tax (β_py) were statistically significant, that is, the long-term effect exists. The slope after the tax (β_y + β_py) is -0.12 on average. This means that the amount of smoking decreases by 0.12 annually.

Table 2. Estimates by Income

	All	Low Income	Middle Income	High Income
Year_t (β_y)	-0.2992***	-0.3940***	-0.2133***	-0.3005***
Policy_t(β_p)	-3.3883***	-3.5580***	-3.3570***	-3.2240***
Policy_t*Year_t(β_py)	0.1837***	0.2099**	0.1477*	0.1880**
* p<0.01, p<0.05, *p<0.1
Slope after tax (β_y + β_py)	-0.1154	-0.1841	-0.0657	-0.1124

With the previous results, we can evaluate that the tobacco consumption tax has both short-term and long-term effects; thus, it was an effective policy tool that reduced smoking rates. Meanwhile, if the result is compared with the counterfactual, the evaluation may yield different conclusions. Figure 1 illustrates the slope of the average smoking trend without the tax (red) and with the tax (blue). The blue line indicates a short-term effect in 2015, where the amount decreased by 3.39, followed by a long-term effect of an annual reduction of 0.12. However, the counterfactual (red) shows that the slope of the reduction trend was -0.3 and would have reached a comparable smoking amount by 2032, even without the tax.

Counterfactual Comparison

However, when compared with the counterfactual scenario (the trend projected without the tax increase), the long-term effectiveness becomes less clear. The simulated “no-tax” trend suggests that smoking rates would have declined at a faster rate (–0.30 per year) even without the policy, converging with the actual post-tax trend around 2032. Hence, while the tax successfully triggered an immediate reduction, its relative long-term impact may diminish over time.

Policy Implications and Limitations

The results imply that tobacco taxation is effective in reducing smoking in both the short and medium term, but its impact could weaken in the long run unless complemented by continuous policy interventions. Given that tobacco consumption is highly price-sensitive, indexing the tax rate to inflation could help sustain its deterrent effect. Furthermore, analysis across income groups reveals evidence of tax regressivity. In other words, lower-income smokers are more price-responsive yet spend a larger share of income on tobacco, highlighting the need for non-price measures to mitigate disproportionate burdens on low-income households.

Harnessing the Power of the WB API: A Comprehensive Guide for Developers and Data Analysts

2025-10-13
The purpose of this post is to introduce WB API, which serves as a powerful tool for developers and data analysts alike. It provides a comprehensive framework for accessing a vast array of socio-economic indicators. In this guide, we will delve into the steps required to start with the API, beginning from the initial searching of the extensive indicator list to effectively calling and retrieving the entire dataset. Understanding how to navigate these functionalities will enable users to harness valuable insights and make informed decisions based on the data obtained. Whether you are a beginner or an experienced user, this post aims to equip you with the knowledge necessary to efficiently utilize the WB API.

1. Exploring the WB API: A Comprehensive Guide to World Development Indicators

Basically, the endpoint “api.worldbank.org/v2/indicator” is crucial as it calls the comprehensive list of indicators provided by the World Bank, which encompasses a total of 29,201 distinct metrics. These indicators cover various aspects of global economic and social dynamics, making them invaluable for research and policy-making. The important thing about this part is that we need a specific ID for an indicator to access the complete dataset associated with it.
```
## Indicator List ##
ind_list=requests.get('http://api.worldbank.org/v2/indicator/all?&per_page=30000&format=json').json()
ind_list=pd.DataFrame(ind_list[1])
print(ind_list.shape) # 29201
ind_list[['id','name','sourceNote','sourceOrganization']].head()
```
Search in Indicator List

However, because there are too many indicators, it is difficult to find the id that you are interested in. The overwhelming variety of options can lead to confusion and frustration, making it a challenge to narrow down your search effectively. To simplify this process, the following code could help you find indicators you are interested in more efficiently. By utilizing specific search parameters and filters, you can refine your results, allowing for a more targeted approach. Additionally, you can easily change the keyword to suit your particular needs, ensuring that you discover the most relevant indicators for your project or analysis.
```
## search in indicator list ##
keyword = "population"
ind_list[['id', 'name']][ind_list['name'].str.contains(keyword)]
```
2. Accessing Comprehensive Data: Fetching the Complete Dataset from the WB API

1) basic format

We can set up the range of years and countries, but I prefer to download the whole dataset because I usually use it for panel analysis, which provides a comprehensive view of trends over time. Analyzing variable interactions in such a format allows for richer insights and more robust conclusions. Therefore, the following code would be specifically designed to download the entire dataset of an indicator, ensuring that all relevant data points are included for thorough examination. The code calls the population dataset for all available years and countries, enabling a complete picture of demographic changes globally. However, without further cleaning and organization of this raw data, it seems not appropriate for an in-depth analysis, as the presence of inconsistencies can lead to skewed results or misinterpretations of the trends being studied, highlighting the necessity of preprocessing steps before diving into detailed analyses.
```
## Call who
def wbcall(ind):
    url = f'http://api.worldbank.org/v2/country/all/indicator/{ind}?&per_page=30000&format=json'
    response = requests.get(url)
    test = response.json()
    a = pd.DataFrame(test[1])
    return a
    
wbcall('SP.POP.TOTL').head()
```
2) cleaned version

Following code includes cleaning the initial dataset to ensure that all entries are consistent and ready for analysis. This process is crucial as it removes any discrepancies or anomalies that could skew the results. As you can see from the result, WB includes regional aggregation as well, allowing for a more nuanced understanding of the data by grouping it into meaningful categories. This aggregation helps to identify trends and patterns that may not be visible in a more granular dataset, facilitating better decision-making based on regional insights.
def wbcall(ind): url = f'http://api.worldbank.org/v2/country/all/indicator/{ind}?&per_page=30000&format=json' response = requests.get(url) test = response.json() a = pd.DataFrame(test[1]) a['iso2'] = a['country'].apply(lambda x: x.get('id') if isinstance(x, dict) else None) a['country_name'] = a['country'].apply(lambda x: x.get('value') if isinstance(x, dict) else None) a = a.drop(['indicator', 'country', 'unit', 'obs_status', 'decimal'], axis=1) a = a.rename(columns={'countryiso3code': 'iso3', 'value': ind}) a = a[['country_name','iso2', 'iso3', 'date', ind]] return a print(wbcall('SP.POP.TOTL').shape) wbcall('SP.POP.TOTL').head()
```
def wbcall(ind):
    url = f'http://api.worldbank.org/v2/country/all/indicator/{ind}?&per_page=30000&format=json'
    response = requests.get(url)
    test = response.json()
    a = pd.DataFrame(test[1])
    a['iso2'] = a['country'].apply(lambda x: x.get('id') if isinstance(x, dict) else None)
    a['country_name'] = a['country'].apply(lambda x: x.get('value') if isinstance(x, dict) else None)

    a = a.drop(['indicator', 'country', 'unit', 'obs_status', 'decimal'], axis=1)
    a = a.rename(columns={'countryiso3code': 'iso3', 'value': ind})
    a = a[['country_name','iso2', 'iso3', 'date', ind]]
    return a
print(wbcall('SP.POP.TOTL').shape)
wbcall('SP.POP.TOTL').head()
```
3. Additional Considerations: Best Practices for Using the WB API

When working with the WB API, keep the following best practices in mind to ensure efficient and effective data management:
- API Request Limitations: Be aware of any rate limits imposed by the API. Avoid overwhelming the server with rapid successive requests that can result in being blocked.
- Data Update Frequency: Understand how often the data is updated and reflect this in your analyses or reports to maintain the relevance of your findings.
- Error Handling: Implement error handling in your API requests to manage cases where the API may be temporarily unavailable. Use try-except blocks to gracefully handle exceptions and provide feedback.
- Documentation and Community: Lastly, regularly consult the World Bank API documentation and engage with the community forums for support and additional resources. This can be immensely helpful for troubleshooting and discovering new features or datasets.
By adhering to these best practices, you can maximize the benefits of the WB API, ultimately enhancing your data analysis capabilities and providing more insightful policy analyses.

4. Conclusion

In conclusion, the WB API serves as an invaluable resource for accessing a wealth of development indicators essential for research and analysis. With the right approaches to navigating the data, filtering through the indicators, fetching datasets, and visualizing the results, you can transform raw data into valuable insights that inform decisions and strategies. Whether you are diving into individual indicators or analyzing comprehensive datasets, the tools and techniques introduced in this guide will support you in making the most of the World Bank’s extensive data offerings.

recent posts

Using the Media Cloud API to Query and Construct Country-Level Datasets

Get API Key

National Sources List

Key Features of Query Syntax

Query Example

Building a Clean Macroeconomic Panel from the IMF World Economic Outlook Database

WEO Dictionary

Dropping Redundant Columns

Creating New Macroeconomic Variables

Where Are Population and Racial Groups Concentrated in Washington, D.C.?

Overall Population Distribution

Top Four Racial/Ethnic Groups (Black/African American, White, Hispanic/Latino, Asian)

Black/African American vs. White Distribution

Hispanic or Latino

Asian

Hot-Spot (Local Moran’s I) Analysis by Race/Ethnicity

Summary

Constructing ISO Country List along with ISO2, ISO3, and Continents

World Bank Country List

IMF Country List

UN Country List

Merge all three sources

Impact of South Korea’s Tobacco Tax Increase: Short-Term Gains and Long-Term Challenges

Data and Variables

Methodology

Results

Counterfactual Comparison

Policy Implications and Limitations

Harnessing the Power of the WB API: A Comprehensive Guide for Developers and Data Analysts

1. Exploring the WB API: A Comprehensive Guide to World Development Indicators

Search in Indicator List

2. Accessing Comprehensive Data: Fetching the Complete Dataset from the WB API

1) basic format

2) cleaned version

3. Additional Considerations: Best Practices for Using the WB API

4. Conclusion