CREATIVE --- datasets

Welcome! This repo contains datasets for political ad classification on social media.

This repo is a part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.

To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. This repo is a part of the Data Processing step of our pipeline.

1. Introduction

This repo store datasets which are used as inputs to classifiers and other scripts of the CREATIVE project. Some repos utilize TV data, which are not included in this repo due to contractual restrictions. To access the TV datasets please apply directly. Visit http://mediaproject.wesleyan.edu/dataaccess/ and fill out the online request form for accessing TV datasets. The creation of the datasets in this repo is not assumed to be replicable.

The datasets in this repo are utilized in the following CREATIVE repos:

2. Data

The data in this repo is mostly in csv format.

2.1 `candidates`

The data in the candidates folder is mostly about political candidates' information:

The data started with wmpcand is the candidate characteristics collected by WMP and in partnership with OpenSecrets, where each candidate has its own unique identifier created and used by WMP (because candidates often have multiple Federal Election Commission (FEC) identifiers since they can run for different offices within and across cycles, WMP uses our own to help distinguish individuals). The data opponents_2022.R contains information on political candidates and their opponents for elections held in 2022, which is needed to compute ad tone based on whether an ad mentions an opponent. Both face_url_politician.csv and face_url_candidate.csv contain candidates' face_url that can be used for face recognition.
As noted previously, some candidates have multiple Federal Election Commission (FEC) identifiers (fecids). corrections_fecids.csv exists to deal with this issue. For any fecid (first column), the second column specifies what it should be changed to. If a candidate had 3 fecids, and, say, the first one is the correct one, then they would simply have two rows, one with 2 -> 1, and one with 3 -> 1. The third column is purely cosmetic, just so that it is easier to remember who is who.

2.2 `facebook`

The data in the facebook folder contains a range of datasets that are used in the Facebook ads collection. asr_fb2022_0905_1108.csv is the dataset for auto-speech recognition. It contains ads' unique id, location in the WesMedia server, SHA256 checksum for the video file, asr status, models, media types, etc.

2.3 `people`

The data in the people folder contains information about the candidates collected by WMP who appear in the ads. The datasets contain fields including the candidates' unique id: wmpid, their full name full_name, FEC identifiers for different years or campaign cycles fecid_2020, fecid_2022a,fecid_2022b, fecid_2022old, Date candidate was added to the WMP person-level file dateadded_person, etc.

2.4 `wmp_entity_files`

The data in the wmp_entity_files folder contains entity files, which are essentially information about each advertiser, created by WMP and in partnership with OpenSecrets. These files essentially match advertiser information to other sources of data like FEC identifiers and add important contextual information like whether the advertiser is a candidate, a party, an outside group, or unknown, etc. The datasets in this folder are divided into Google entity files and Facebook entity files.

2.5 `google`

The data in the 'google' folder is composed of files containing Google 2020 data. These files include google_2020_adid_06102022.csv.gz, google_2020_adid_text_clean.csv.gz and google_2020_adid_var1.csv.gz. Generally these files are being preserved for internal use, as they are used in legacy scripts.

3. Thank You

We would like to thank our supporters!

This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.

The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
candidates		candidates
facebook		facebook
google		google
people		people
wmp_entity_files		wmp_entity_files
.gitignore		.gitignore
CREATIVE_logo.png		CREATIVE_logo.png
CREATIVE_step2_032524.png		CREATIVE_step2_032524.png
LICENSE		LICENSE
README.md		README.md
nsf.png		nsf.png
plt_logo.png		plt_logo.png
wmp-logo.png		wmp-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREATIVE --- datasets

Table of Contents

1. Introduction

2. Data

2.1 `candidates`

2.2 `facebook`

2.3 `people`

2.4 `wmp_entity_files`

2.5 `google`

3. Thank You

About

Releases

Packages

Contributors 8

Languages

License

Wesleyan-Media-Project/datasets

Folders and files

Latest commit

History

Repository files navigation

CREATIVE --- datasets

Table of Contents

1. Introduction

2. Data

2.1 candidates

2.2 facebook

2.3 people

2.4 wmp_entity_files

2.5 google

3. Thank You

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

2.1 `candidates`

2.2 `facebook`

2.3 `people`

2.4 `wmp_entity_files`

2.5 `google`

Packages