
In today’s data-driven media landscape, understanding how news and public discourse evolve across time and geography has become critical for researchers, journalists, and policymakers. Media Cloud, an open-source platform developed by the MIT Center for Civic Media and Harvard’s Berkman Klein Center, offers a powerful way to explore such questions through structured media data. Its API gives users direct programmatic access to millions of news stories collected from thousands of media sources around the world, enabling large-scale text analysis, topic tracking, and narrative comparison.
This blog introduces how to use the Media Cloud API effectively from retrieving topic-specific story data and analyzing word frequencies to building visualizations that capture media attention patterns. Whether you are a data scientist studying agenda-setting or a policy researcher examining framing differences across regions, Media Cloud’s API provides a transparent, reproducible pathway for analyzing media ecosystems at scale.
Get API Key
To begin, please create an account and submit a request for API access on your profile page at Media Cloud. Following your request, you will receive an email containing your API key.

Before proceeding to other codes, please import the following libraries. For “MC_API_KEY”, enter your API key.
import os, mediacloud.api
import pandas as pd
import numpy as np
import json
from importlib.metadata import version
from dotenv import load_dotenv
import datetime as dt
from datetime import datetime
from IPython.display import JSON
import bokeh.io
import pycountry
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
import re
bokeh.io.reset_output()
bokeh.io.output_notebook()
MC_API_KEY = 'Your API Key'
search_api = mediacloud.api.SearchApi(MC_API_KEY)
f'Using Media Cloud python client v{version("mediacloud")}'National Sources List
Fortunately, Media Cloud offers a “National Collection” list that encompasses domestic affairs within individual countries. However, to obtain data for each country, it is necessary to have an ‘ID’ corresponding to the national collection for that country. In this regard, I discovered a URL that contains comprehensive information on all sources within Media Cloud. I transformed this JSON format into a data frame and refined it to retain only the national collection. Notably, the national collection ID includes the string ‘National’ in the name column.
## s = save the text from the URL
data = json.loads(s)
lst = data["results"]
df = pd.DataFrame(data["results"])
df_nat = df[df['name'].str.contains('National', na=False)].reset_index(drop=True)
df_nat["country"] = df_nat["name"].str.replace(" - National", "", regex=False)
def get_iso3(name):
try:
return pycountry.countries.lookup(name).alpha_3
except LookupError:
return None
df_nat["ISO3"] = df_nat["country"].apply(get_iso3)
iso_map = {
"Turkey": "TUR",
"Kosovo": "XKX",
"Sri Lanka": "LKA",
"Macedonia, Republic of": "MKD",
"Saint Helena": "SHN"
}
df_nat.loc[df_nat['name']=='Sri Lanka- National', "country"] = 'Sri Lanka'
df_nat.loc[df_nat['ISO3'].isna(), "ISO3"] = df_nat["country"].map(iso_map)
df_nat=df_nat[df_nat["ISO3"].notna()]
df_nat.head(10)
print(df_nat.shape)This will provide the subsequent formatted data. For the API query, it is essential to include the id, country, and ISO3 codes.

Key Features of Query Syntax
There exist numerous query features that facilitate users in accessing the relevant APIs. Among these features, several are especially advantageous. I have found the proximity search to be particularly beneficial for my data collection (Query Guide).
- Wildcards
- Single-character wildcard: e.g.,
wom?nmatches “woman” or “women”. - Multi-character wildcard: e.g.,
wom*norwom*matches “woman”, “women”, and also “womxn”.
- Single-character wildcard: e.g.,
- Hyphens & quoted phrases
- Terms that include hyphens must be enclosed in quotes: e.g.,
"two-spirit". - To search an exact phrase, use double quotes: e.g.,
"unhoused person".
- Terms that include hyphens must be enclosed in quotes: e.g.,
- Default logic / Boolean operators
- Terms placed next to each other without explicit operators are implicitly ANDed (e.g.,
monkey banana=monkey AND banana). - Negation is supported: prefix a term with
NOTor-to exclude it (e.g.,gaza NOT hamas).
- Terms placed next to each other without explicit operators are implicitly ANDed (e.g.,
- Columns / field‐specific queries
- You can specify the field to be searched, for example:
title: bidenorarticle_title: biden.
- You can specify the field to be searched, for example:
- Language filter
- To restrict to a specific language, include
language:en(or the relevant ISO 639-1 code).
- To restrict to a specific language, include
- Unsupported features
- Regular expressions are not supported.
- Fuzzy queries (e.g.,
Biden~to match “Bidan”, “Bidden”) are not supported.
- Proximity search
- You can search for words within a certain distance of each other: e.g.,
"woman killed"~10finds cases where “woman” and “killed” are within 10 words.
- You can search for words within a certain distance of each other: e.g.,
Query Example
The following code delineates the API query utilized to engage with the National Collection. The example illustrates that the query incorporates the terms ‘fiscal deficit’, ‘budget deficit’, and ‘public debt’, employing a proximity search of 10 words. Boolean operators, such as AND or OR, may be utilized in accordance with your preferred search.
## United Kingdom's National Collection
start_date = dt.date(2025, 9, 1)
end_date = dt.date(2025, 9,30)
sources = [34412476] ## UK National Collection ID
my_query = '"fiscal deficit"~10 OR "budget deficit"~10 OR "public debt"~10'
api_result = search_api.story_count_over_time(my_query, start_date, end_date, collection_ids=sources)
api_result[2]The outcome is as follows: it acquires the date, total count, count, and ratio. The ratio, referred to as attention over time, represents media attention focused on specific keywords. This metric assesses the proportion of total media coverage allocated to the query terms within a specified time frame. In other words, it illustrates the relative amount of attention a topic has garnered compared to all other narratives.

Once the variable api_result is converted to a DataFrame, the structure of the data appears as illustrated below. To create a panel dataset, it is essential to implement a ‘for loop’ that retrieves data based on ISO3 codes and their corresponding National Collection IDs, subsequently appending all collected data at the conclusion of the process.
df = pd.DataFrame(api_result)
df
As you can see from the data, Media Cloud provides not only the number of articles including specific keywords but also the ratio of the keyword that shows the attention of certain topics over time. This feature allows researchers and analysts to identify trends and shifts in public interest, making it an essential tool for anyone looking to understand the evolving landscape of media discourse. Especially, it provides valuable daily datasets, which also can be converted into monthly, quarterly, and yearly datasets, thereby ensuring that users can conduct longitudinal studies with ease.
These datasets can be instrumental when utilized in econometric analysis, allowing for more informed decision-making based on historical trends and projected outcomes. Furthermore, the ability to break down data by various time frames enhances the granularity of insights, enabling users to pinpoint not only what topics are gaining traction but also when these shifts occur, thereby providing a comprehensive understanding of media dynamics.






















