Skip to main content

Lexical sentiment analysis pipeline for central bank and economic text data.

Project description

AutoEconSentiment

Python License

A streamlined, production-ready pipeline for extracting and analyzing economic sentiment from textual data — focused on high-performance lexical sentiment analysis using established central bank and financial dictionaries.


Q. Quick Start

Q.1 Install

Ensure you have uv installed, then synchronize the environment:

uv sync

Q.2 Run on Your Own Data (Python API)

from auto_econ_sentiment.pipeline import AutoEconSentiment

analyzer = AutoEconSentiment(
    import_file_path="data/raw/basic_tests/monetary_policy_statement.parquet.gzip",
    text_column="text",
    date_column="date",
    export_path="data/sentiment/basic_tests/"
)
analyzer.run(
    clean_config={"tokenize": True, "stem": True},
    dictionaries={"unstemmed": ["correa", "hubert", "lm", "hiv"], "stemmed": ["ap", "bn"]},
    aggregation_methods=["posneg", "allwords"],
    export_results=True
)

Q.3 Run from YAML Config

Configure inputs, cleaning rules, and dictionaries in params.yaml, then run:

uv run python -m src.auto_econ_sentiment.pipeline

Q.4 Run the CBS Speeches Demo

Download ~35K central bank speeches and run sentiment analysis across all 143 central banks:

# 1. Download the CBS dataset and split by central bank
uv run python -m src.data.cb_speeches_download

# 2. Run the sentiment pipeline over all banks
uv run python -m src.data.cb_speeches_clean

Then open notebooks/demo_cb_speechs.ipynb to explore the results interactively.


1. Features

  • Robust Text Cleaning: Handles HTML stripping, unicode normalization, special character encodings, percent/number normalization, configurable header removal, tokenization, and Porter stemming.
  • Lexical Sentiment Analysis: Computes document-level positive/negative word counts and sentiment scores across 6 established dictionaries with 2 aggregation methods (posneg, allwords).
  • YAML-Driven Configuration: All pipeline parameters (input paths, cleaning rules, dictionaries) are managed through params.yaml — no hard-coded values.
  • CBS Speeches Demo: End-to-end demonstration on 35K central bank speeches from 143 countries (1986-2023).

2. Data

2.1 Input Data (data/raw/)

Contains immutable, original input data. Never modified directly. Note: Raw datasets are large and excluded from version control (.gitignore). You must download or generate them locally using the scripts in src/data/.

Path Description
data/raw/basic_tests/monetary_policy_statement.parquet.gzip FOMC monetary policy statements. Used as the primary test and demo dataset for params.yaml and --test mode. Columns: text, date.
data/raw/basic_tests/statements_speeches.parquet.gzip A small mixed sample of central bank statements and speeches. Used for quick pipeline validation.
data/raw/speeches/CBNAME.parquet.gzip The full CBS Central Bank Speeches Dataset (~35K speeches, 143 central banks, 1986-2023), split into one file per central bank. Generated by src/data/cb_speeches_download.py. Columns: URL, PDF, Title, Subtitle, Date, Authorname, Role, Gender, CentralBank, Country, text, text_original, Filename, Language, Source.

2.2 Sentiment Outputs (data/sentiment/)

Contains all outputs generated by the AutoEconSentiment pipeline.

Path Description
basic_tests/cleaned.parquet.gzip Cleaned and tokenized text from the basic test dataset.
basic_tests/sentiment_all_results.csv Combined sentiment results for the basic test dataset across all dictionaries and methods.
cb_speeches/CBNAME/cleaned.parquet.gzip Cleaned and tokenized speeches for each central bank.
cb_speeches/CBNAME/sentiment_all_results.csv Final sentiment scores per speech for each central bank, with columns for each {dictionary}_{method}_sentiment combination.

2.3 Configuration (references/configs/)

File Description
params.yaml Main pipeline configuration for the basic FOMC test dataset.
references/configs/params_cb_speeches.yaml Pipeline configuration for the CBS central bank speeches demo.

3. Library Components (src/auto_econ_sentiment/)

3.1 pipeline.py — Main Orchestrator

The AutoEconSentiment class is the primary entry point. It orchestrates loading, cleaning, and sentiment analysis via its run() method. Accepts import_file_path, text_column, date_column, and export_path. Can also be invoked from the command line with --test for a built-in synthetic data run.

3.2 clean/text_loader.py — Data Loader

TextLoader handles loading input data from csv, parquet, and parquet.gzip formats. Validates that the required text_column and date_column are present and returns a clean copy of the DataFrame.

3.3 clean/text_clean.py — Text Cleaner

TextCleaner applies a configurable multi-step cleaning pipeline:

  • HTML stripping and unicode normalization
  • British-to-American English conversion (clean/references/british_2_american.py)
  • Number and percentage normalization
  • Configurable header/boilerplate removal
  • Word tokenization (splits text into token lists)
  • Porter stemming (reduces tokens to root forms for stemmed dictionaries)

Cleaned text is assigned a unique id_text for downstream joining.

3.4 models/sentiment_lexical.py — Lexical Sentiment Model

SentimentLexical performs bag-of-words sentiment scoring against a vocabulary loaded from the master YAML dictionary (data/lexical_master_dict.yaml). For each dictionary, it counts positive and negative word occurrences and computes a sentiment score using one of two methods:

  • posneg: 1 + (pos - neg) / (pos + neg) — normalized to the sentiment words only.
  • allwords: 1 + (pos - neg) / total_tokens — normalized to all words in the document.

3.5 models/sentiment_base.py — Abstract Base

SentimentBase is the abstract base class for sentiment models, providing shared input DataFrame handling and the text_column interface.

3.6 data/lexical_master_dict.yaml — Dictionary Definitions

Master YAML file containing the positive/negative word lists for all 6 supported dictionaries: hubert, lm, hiv, correa, bn, ap.

3.7 exceptions.py — Custom Exceptions

Defines DataLoadError and SentimentAnalysisError for structured error handling throughout the pipeline.

3.8 utils/load_yaml.py — YAML Config Loader

load_yaml_config() loads and validates pipeline configuration from a YAML file using yaml.safe_load().

3.9 utils/paths.py — Path Utilities

Shared path resolution helpers.

3.10 clean/text_viz.py — Cleaning Visualizer

Utilities for visualizing text before and after cleaning (for exploratory and debugging use).


4. Tests (tests/)

The test suite is in tests/test_pipeline.py and covers the full pipeline from data loading to sentiment output. Run with:

uv run pytest
Test Description
test_loader_synthetic_csv Verifies TextLoader correctly loads a synthetic CSV.
test_loader_missing_column Confirms an error is raised when required columns are absent.
test_loader_unsupported_format Confirms an error is raised for unsupported file types.
test_loader_returns_copy Verifies the loader returns a defensive copy.
test_cleaner_basic_run_on_fomc Runs TextCleaner on real FOMC data and validates output shape.
test_cleaner_header_removal Verifies boilerplate header strings are removed.
test_cleaner_tokenize_fomc Checks tokenized output is a non-empty list of strings.
test_cleaner_stem_fomc Confirms stemming reduces tokens to root forms.
test_cleaner_percentage_normalization Verifies percentages are normalized correctly.
test_cleaner_assigns_id_text Confirms each row receives a unique id_text identifier.
test_cleaner_missing_column Confirms a clear error when the text column is missing.
test_sentiment_hubert_posneg Runs Hubert dictionary with posneg method and checks score range.
test_sentiment_lm_posneg Runs LM dictionary with posneg method.
test_sentiment_correa_allwords Runs Correa dictionary with allwords method.
test_sentiment_text_column_override Verifies overriding the text column does not mutate the original DataFrame.
test_sentiment_unknown_dictionary Confirms a clear error for unknown dictionary names.
test_sentiment_word_counts_nonzero Verifies that matched sentiment word counts are > 0 on real data.
test_public_api_imports Confirms the public API imports correctly from the package.
test_version_is_string Verifies __version__ is a valid string.

5. Citations

Lexical Dictionaries

  • Loughran-McDonald (LM): Loughran, T. and B. Mcdonald (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance 66, 35–65.
  • Correa: Correa, R., K. Garud, J. Londono, and N. Mislang (2017). Sentiment in Central Bank as Financial Stability Reports. Board of Governors of the Federal Reserve System Research Series. International Finance Discussion Paper 1203.
  • Hubert: Hubert, P. and F. Labondance (2021). The signaling effects of central bank tone. European Economic Review 133, 103684.
  • General Inquirer (HIV):
    • Stone, Philip J., Dexter C. Dunphy, and Marshall S. Smith. "The general inquirer: A computer approach to content analysis." (1966).
    • Lasswell, Harold Dwight, and Nathan Constantin Leites. "Language of politics: Studies in quantitative semantics." (1966).
  • Apel-Blix Grimaldi (AP): Apel, M. and M. Blix Grimaldi (2014). How Informative Are Central Bank Minutes? Review of Economics 65(1), 53-76.
  • Bennani-Neuenkirch (BN): Bennani, H. and M. Neuenkirch (2017). The (Home) Bias of European Central Bankers: New Evidence Based on Speeches. Applied Economics 49(11), 1114-1131.

6. Scripts and Notebooks

6.1 src/data/ (Data Pipelines & Orchestration)

The src/data/ folder contains orchestration scripts used to fetch external datasets and orchestrate sentiment analysis runs. These scripts function as standalone execution entry points.

  • cb_speeches_download.py: Ingests the central bank speeches dataset from cbsspeeches.org and partitions the data into data/raw/.
  • cb_speeches_clean.py: Orchestrates the AutoEconSentiment pipeline specifically for the CBS speeches dataset, producing the sentiment outputs locally.

6.2 notebooks/ (Exploration & Demos)

The notebooks/ folder contains exploratory data analysis (EDA) and demonstration Jupyter Notebooks. These notebooks consume the processed data generated by the src/data/ orchestration scripts.

  • autoecon_demo.ipynb: A general walkthrough demo.
  • demo_cb_speechs.ipynb: An interactive output visualization notebook showcasing results processed by cb_speeches_clean.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_econ_sentiment-0.1.1.tar.gz (51.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auto_econ_sentiment-0.1.1-py3-none-any.whl (52.9 kB view details)

Uploaded Python 3

File details

Details for the file auto_econ_sentiment-0.1.1.tar.gz.

File metadata

  • Download URL: auto_econ_sentiment-0.1.1.tar.gz
  • Upload date:
  • Size: 51.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for auto_econ_sentiment-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e33d6afff81d588161a839be5f74c865c697c45cffc67d7b11d7b2af4c32f4d1
MD5 c98c4ed850f518500e2f54d0644b481f
BLAKE2b-256 f44f4b75f8001c0e78913055bcfca90599d7bf351e0617f581d99e2cfd0b9a89

See more details on using hashes here.

Provenance

The following attestation bundles were made for auto_econ_sentiment-0.1.1.tar.gz:

Publisher: publish.yml on corybaird/auto-econ-sentiment

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file auto_econ_sentiment-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for auto_econ_sentiment-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6bfdca4975e4c7107f182c182236dcbafa06618614f94f9f9320f22540219b8d
MD5 5927052329e7365ca1df5cb6f93b6ebf
BLAKE2b-256 4350d7fed0aec0ad5d8cb526ff323c8ddaafd82e7a0eb2fdbad20c48e2e1ec95

See more details on using hashes here.

Provenance

The following attestation bundles were made for auto_econ_sentiment-0.1.1-py3-none-any.whl:

Publisher: publish.yml on corybaird/auto-econ-sentiment

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page