NOTE: This repo is for usability study purposes only. The main Entity Linking repo is here: https://github.com/Wesleyan-Media-Project/entity_linking_2022.
Welcome! This repo contains scripts for identifying and linking election candidates and other political entities in political ads on Google and Facebook. The scripts provided here are intended to help journalists, academic researchers, and others interested in the democratic process to understand which political entities are connected and how.
This repo is a part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.
To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification Step in our pipeline.
1. Video Tutorial
2. Overview
3. How to Run the Scripts
- 0. Cloning This Repository
- 1. Constructing Knowledge Base
- 2. Training Entity Linker
- 3. Making Inferences
4. Results Storage
5. Results Analysis
6. Thank You
Entity_Linking_Tutorial_Draft1.mp4
If you are unable to see the video above (e.g., you are getting the error "No video with supported format and MIME type found"), try a different browser. The video works on Google Chrome.
Or, you can also watch this tutorial through YouTube.
This repo contains an entity linker for 2022 election data. The entity linker is a machine learning classifier and was trained on data that contains descriptions of people and their names, along with their aliases. Data are sourced from the 2022 WMP person_2022.csv and wmpcand_120223_wmpid.csv --- two comprehensive files with names of candidates and other people in the political process. Data are restricted to general election candidates and other non-candidate people of interest (sitting senators, cabinet members, international leaders, etc.).
While this repo applies the trained entity linker to the 2022 US elections ads, you can also apply our entity linker to analyze your own political ad text datasets to identify which people of interest are mentioned in ads. The entity linker is especially useful if you have a large amount of ad text data and you do not want to waste time counting how many times a political figure is mentioned within these ads. You can follow the setup instructions below to apply the entity linker to your own data.
There are separate folders for running the entity linker depending on whether you want to run it on Facebook or Google data. For both Facebook and Google, the scripts need to be run in the order of three tasks: (1) constructing a knowledge base of political entities, (2) training the entity linking model, and (3) making inferences with the trained model. The repo provides reusable code for these three tasks. For your overview, we describe the three tasks in the following. Note that we provide a knowledge base and pre-trained models that are ready for your use on Google and Facebook 2022 data. For this data you can start right away making inferences and skip steps 1 and 2. However, if you want to apply our inference scripts to a different time period (for example, another election cycle) or in a different context (for example, a non-U.S. election), then you would need to create your own knowledge base and train your own models.
-
Constructing a Knowledge Base of Political Entities
The first task is to construct a knowledge base of political entities (people) of interest.
The knowledge base of people of interest is constructed from facebook/knowledge_base/01_construct_kb.R. The input to the file is the data sourced from the 2022 WMP persons file person_2022.csv. The script constructs one sentence for each person with a basic description. Districts and party are sourced from the 2022 WMP candidates file wmpcand_120223_wmpid.csv, a comprehensive file with names of candidates.
The knowledge base has four columns that include entities'
id
,name
,descr
(for description), andaliases
. Examples of aliases include Joseph R. Biden being referred to as Joe or Robert Francis O’Rourke generally being known as Beto O’Rourke. Here is an example of one row in the knowledge base:id name descr aliases WMPID1770 Adam Gray Adam Gray is a Democratic candidate for the 13rd District of California. Adam Gray,Gray,Adam Gray's,Gray's,ADAM GRAY,GRAY,ADAM GRAY'S,GRAY'S -
Training the Entity Linking Model
The second task is to train an entity linking model using the knowledge base.
Once the knowledge base of people of interest is constructed, the entity linker can be initialized with spaCy, a natural language processing library we use, in facebook/train/02_train_entity_linking.py.
After successfully running the above scripts in the training folder, you should see the following trained model in the
models
folder:intermediate_kb
trained_entity_linker
-
Making Inferences with the Trained Model
The third task is to make inferences with the trained model to automatically identify and link entities mentioned in new political ad text.
To perform this task you can use the scripts in the inferences folders, facebook/inference and google/inference. The folders incluced variations of scripts to disambiguate people, for example, multiple "Harrises" (e.g., Kamala Harris and Andy Harris).
Note: You can skip steps 1 (Constructing a Knowledge Base of Political Entities) and 2 (Training the Entity Linking Model) if you decide to instead use our own knowledge base and pre-trained entity linker model. Our knowledge base (entity_kb.csv
) is already conveniently located within the repository, but you'll need to download the pre-trained entity linker manually. The model is hosted on our Figshare, which you can access by following this link and completing the Data Access Form. This will immediately redirect you to a page from which you can download the model!
If you need additional technical support, here is a Terminal User Guide for macOS/Linux users and here is a Powershell User Guide for Windows users.
In order to run the scripts in this repository, you'll need to copy this repository as a folder onto your own computer. To do so:
-
Open up your Terminal application, which is located in
Applications/Utilities
on a Mac. -
Execute the following commands in order to clone this repository onto your computer, in the home directory of your file manager (Finder):
cd $HOME git clone https://github.com/Wesleyan-Media-Project/entity_linking_2022_usabilitystudy.git
-
Click the Start Menu, search for the Powershell application, and select Windows Powershell in order to open up the Powershell application.
-
Execute the following commands in order to clone this repository onto your computer, in the home directory of your file manager (File Explorer):
cd $HOME git clone https://github.com/Wesleyan-Media-Project/entity_linking_2022_usabilitystudy.git
To completely set your computer up for, as well as run, the facebook/knowledge_base
script, you can use the setup_kb
scripts we prvoide!
-
If you haven't already, make sure you have Python 3.10.5 and R both installed on your computer. Here is a direct link to the Python 3.10.5 package for macOS, and here is a direct link to the R package for macOS. After downloading, open each package and follow the prompts. Don't forget to check the box that adds each package to your PATH during installation!
If you need further documentation, you can visit the main Python and R sites.
-
Execute the following two commands in order to set up and run
facebook/knowledge_base/01_construct_kb.R
:chmod +x ~/entity_linking_2022_usabilitystudy/setup_kb.sh ~/entity_linking_2022_usabilitystudy/setup_kb.sh
-
If you haven't already, make sure you have Python 3.10.5 and R both installed on your computer. Here is a direct link to the Python 3.10.5 package for Windows, and here is a direct link to the R package for Windows. After downloading, open each package and follow the prompts. Don't forget to check the box that adds each package to your PATH during installation!
If you need further documentation, you can visit the main Python and R sites.
-
Execute the following two commands in order to set up and run
facebook/knowledge_base/01_construct_kb.R
:Set-ExecutionPolicy RemoteSigned -Scope CurrentUser ~\entity_linking_2022_usabilitystudy\setup_kb.ps1
Note: For more detailed documentation on how to manually complete this step, you can follow this link.
To completely set your computer up for as well as run the facebook/train
scripts, you can use the setup_train
scripts we prvoide!
Note: Running the 02_train_entity_linking.py
script in this step takes multiple hours to complete!
-
Running these scripts requires
fb_2022_adid_text.csv.gz
andfb_2022_adid_var1.csv.gz
, which are hosted on our Figshare. If you have not downloaded these datasets yet, you can do so by following this link and completing the Data Access Form, which will redirect you to a page from which you can download both datasets. Do not move the files from your Downloads folder!Note: Make sure you download as gzip files! If they don’t download this way automatically and you use Safari, then you may need to uncheck the option to "open 'safe' files after downloading" in your General Safari settings before trying again.
-
Execute the following two commands in order to set up and run
facebook/train
:chmod +x ~/entity_linking_2022_usabilitystudy/setup_train.sh ~/entity_linking_2022_usabilitystudy/setup_train.sh
Note: You may be prompted for your password. This just gives the script permission to move the datasets and trained entity linker model from your Downloads folder to the appropriate locations! You should use the same password that you use to log into your computer!
-
Execute the following two commands in order to set up and run
facebook/train
:Set-ExecutionPolicy RemoteSigned -Scope CurrentUser ~\entity_linking_2022_usabilitystudy\setup_train.ps1
Note: You may be prompted for your password. This just gives the script permission to move the datasets and trained entity linker model from your Downloads folder to the appropriate locations! You should use the same password that you use to log into your computer!
Note: For more detailed documentation on how to manually complete this step, you can follow this link.
To completely set your computer up for, as well as run, the facebook/inference
scripts, you can use the setup_inference
scripts we prvoide!
-
Running the
facebook/inference
scripts requires thefb_2022_adid_text.csv.gz
dataset, and running thegoogle/inference
scripts requires theg2022_adid_01062021_11082022_text.csv.gz
dataset, both of which are hosted on our Figshare. If you have not downloaded these datasets yet, you can do so by following this link and completing the Data Access Form, which will redirect you to a page from which you can download both datasets. Please do not move the files from your Downloads folder!Note: Again, make sure you download as gzip files! If they don’t download this way automatically and you use Safari, then you may need to uncheck the option to "open 'safe' files after downloading" in your General Safari settings before trying again.
-
If you skipped steps 1 (Constructing a Knowledge Base of Political Entities) and 2 (Training the Entity Linking Model), you'll need to download our pre-trained entity linker model. This model is also hosted on our Figshare, and so you can access and download it through the same link as in the prior step! Please do not move the folder from your Downloads folder!
-
If you haven't already, make sure you have Python 3.10.5 and R both installed on your computer. Here is a direct link to the Python 3.10.5 package for macOS, and here is a direct link to the R package for macOS. After downloading, open each package and follow the prompts. Don't forget to check the box that adds each package to your PATH during installation!
If you need further documentation, you can visit the main Python and R sites.
-
Execute the following two commands in order to set up and run
facebook/inference
:chmod +x ~/entity_linking_2022_usabilitystudy/setup_inf.sh ~/entity_linking_2022_usabilitystudy/setup_inf.sh
Note: You may be prompted for your password. This just gives the script permission to move the datasets and trained entity linker model from your Downloads folder to the appropriate locations! You should use the same password that you use to log into your computer!
-
If you haven't already, make sure you have Python 3.10.5 and R both installed on your computer. Here is a direct link to the Python 3.10.5 package for Windows, and here is a direct link to the R package for Windows. After downloading, open each package and follow the prompts. Don't forget to check the box that adds each package to your PATH during installation!
If you need further documentation, you can visit the main Python and R sites.
-
Execute the following two commands in order to set up and run
facebook/inference
:Set-ExecutionPolicy RemoteSigned -Scope CurrentUser ~\entity_linking_2022_usabilitystudy\setup_inf.ps1
Note: You may be prompted for your password. This just gives the script permission to move the datasets and trained entity linker model from your Downloads folder to the appropriate locations! You should use the same password that you use to log into your computer!
Note: For more detailed documentation on how to manually complete this step, you can follow this link.
After successfully running the above scripts in the inference folder, you should see the entity linking results in the data
folder. The data will be in csv.gz
and csv
format. The various Facebook results, for instance, are as follows:
entity_linking_results_fb22.csv.gz
: Ad ID - text field level political entity detection results. Detected entities in each textual variable (e.g., disclaimer, creative boides, detected OCR text) are stored in a list. Each textual variable can have multiple detected entities or no detected entities. Entities are represented by their WMPIDs, which are WMP's unique identifiers for political figures.entity_linking_results_fb22_notext.csv.gz
: This file drops the text column fromentity_linking_results_fb22.csv.gz
for space saving purpose (see below preview table as an example).detected_entities_fb22.csv.gz
: A compact ad ID level entity linking results file. It concatenated all detected entities (given byentity_linking_results_fb22.csv.gz
) from all textual fields of each ad ID.detected_entities_fb22_for_ad_tone.csv.gz
: Filtered entity linking results (compared todetected_entities_fb22.csv.gz
) prepared as input for ad tone detection (a downstream classification task). It excluded detected entities from page names and disclaimers and aggregated text field level results to ad ID level (see this script).
Here is an example of the entity linking results facebook/data/entity_linking_results_fb22.csv.gz:
text | text_detected_entities | text_start | text_end | ad_id | field |
---|---|---|---|---|---|
Senator John Smith is fighting hard for Californians. | WMPID1234 | [8] | [18] | x_1234 | ad_creative_body |
In this example,
- The
text
field contains the raw ad text where entities were detected. - The
text_detected_entities
field contains the detected entities in the ad text. They are listed by their WMPID. WMPID is the unique id that Wesleyan Media Project assigns to each candidate in the knowledge base(e.g. Adam Gray: WMPID1770). The WMPID is used to link the detected entities to the knowledge base. - The
text_start
andtext_end
fields indicate the character offsets where the entity mention appears in the text. - The
ad_id
field contains the unique identifier for the ad. - The
field
field contains the field in the ad where the entity was detected. This could be, for example, thepage_name
,ad_creative_body
, orgoogle_asr_text
(texts that we extract from video ads through Google Automatic Speech Recognition).
The csv.gz
files produced in this repo are usually large and may contain millions of rows. To make it easier to read and analyze the data we have provided two scripts, readcsv.py and readcsvGUI, in the analysis
folder of this repo.
The script readcsv.py
is a Python script that reads and filters the csv.gz
files and saves the filtered data in an Excel file. It has the following features:
- Load a specified number of rows from a CSV file.
- Skip a specified number of initial rows to read the data.
- Filter rows based on the presence of a specified text (case-insensitive).
To run the script, you should be able to use the same virtual environment as created for running the main scripts. If you use a Mac, you can reactivate this environment (if necessary) with the command source venv/bin/activate
. If you use a Windows, you would use the command . venv\Scripts\Activate.ps1
After confirming that the virtual environment is activated, you can run the script with the command line arguments.
For example, to run the script with the default arguments (start from row 0, read 10000 rows, no text filter), you can enter the following command in your terminal:
cd $HOME/entity_linking_2022_usabilitystudy/
python3 analysis/readcsv.py --file facebook/data/entity_linking_results_fb22.csv.gz
You can customize the behavior of the script by providing additional command-line arguments:
--file
: Path to the csv file (required).--skiprows
: Number of rows to skip at the start of the file (default: 0).--nrows
: Number of rows to read from the file (default: Read 10000 rows in the data).--filter_text
: Text to filter the rows (case-insensitive). If empty, no filtering is applied (default: No filter).
For example, to filter rows containing the text "Biden", starting from row 0 and reading 100000 rows:
python3 analysis/readcsv.py --file facebook/data/entity_linking_results_fb22.csv.gz --nrows 100000 --filter_text Biden
To see a help message with the description of all available arguments, you can run the following command:
python3 analysis/readcsv.py --h
Please note that this script may take a while (>10 min) to run depending on the size of the data and the number of rows you requested. If you request the script to read more than 1048570 rows, the output would be saved in multiple Excel files due to the maximum number of rows Excel can handle.
If you feel comfortable working with Terminal and would like results presented in a graphical user interface, you can read instructions on how to set up and run our analysis/readcsvGUI.py
script here.
We would like to thank our supporters!
This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.
The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.