Skip to main content

Word-embedding seed expansion and document scoring. Bring your own seeds. Originally Li, Mai, Shen, Yan (2021, RFS).

Project description

Seed words expansion and measurements using Word2Vec (Li, Mai, Shen, Yan 2021 RFS)

Builds a corpus-specific measurement dictionary with Word2Vec. For each concept you want to measure in your corpus:

  • You provide a short seed-word list per concept.
  • The package builds a ranked dictionary of the words and multi-word phrases your corpus uses to express that concept.
  • You curate the dictionary: inspect, drop noise, add domain words.
  • The package scores every document by weighted hits against the curated dictionary.

If you find this library useful in your research, please cite:

Li, Kai, Feng Mai, Rui Shen, and Xinyan Yan (2021), "Measuring Corporate Culture Using Machine Learning," Review of Financial Studies 34(7):3265-3315, doi.org/10.1093/rfs/hhaa079.


Install

pip install -U lmsy_w2v_rfs

The default preprocessor (corenlp) needs Java and a one-time CoreNLP archive download:

pip install -U "lmsy_w2v_rfs[corenlp]"
lmsy-w2v-rfs download-corenlp                 # one-time, ~1 GB

Java-free alternatives:

pip install -U "lmsy_w2v_rfs[spacy]" && python -m spacy download en_core_web_sm
pip install -U "lmsy_w2v_rfs[stanza]"
pip install -U lmsy_w2v_rfs                   # bare; use preprocessor="static" or "none"

Quickstart

Two concepts, a few seed words each, four lines of pipeline:

from lmsy_w2v_rfs import Pipeline, Config

seeds = {
    "risk":   ["risk", "uncertainty", "volatility", "downside"],
    "growth": ["growth", "expansion", "scale", "opportunity"],
}
texts = [
    "Macro uncertainty and rising rates weighed on margins this quarter.",
    "Strong customer demand drove double-digit revenue expansion across segments.",
    "We hedged commodity exposure to limit downside from price volatility.",
    "Investments in new markets are scaling our growth opportunity.",
    # ... thousands more rows in practice
]

p = Pipeline(
    texts=texts, doc_ids=[f"d{i}" for i in range(len(texts))],
    work_dir="runs/quickstart",
    config=Config(seeds=seeds, preprocessor="none"),
)
p.run()                     # phrase + train + expand + score
p.show_dictionary(top_k=10) # inspect the expanded dictionary
print(p.score_df("TFIDF"))  # per-document scores
=== risk (12 words) ===
  seeds:    risk, uncertainty, volatility, downside
  expanded: risk, uncertainty, volatility, downside, exposure,
            commodity_exposure, rising_rates, hedge, macro_uncertainty
=== growth (14 words) ===
  seeds:    growth, expansion, scale, opportunity
  expanded: growth, expansion, scale, opportunity, customer_demand,
            new_markets, revenue_expansion, double_digit, scaling
Doc_ID risk growth document_length
d0 0.41 0.00 13
d1 0.00 0.55 12
d2 0.62 0.00 12
d3 0.00 0.49 11

To reproduce the 2021 paper exactly:

from lmsy_w2v_rfs import load_example_seeds
seeds = load_example_seeds("culture_2021")    # 47 seeds, 5 dimensions

The construction procedure

The package implements the four-step construction procedure of Li et al. (2021). Each step is a method on Pipeline; calling .run() executes them in order and saves intermediate artifacts under work_dir/ so any step can be redone without redoing the others.

Step 1: Two-step phrase construction

Phrases carry meaning that single words cannot. The package extracts them in two complementary steps targeting different kinds of phrases.

Step 1a, parser-based (general-English phrases). A dependency parser identifies fixed multiword expressions (with_respect_to, rather_than) and compound words (intellectual_property, healthcare_provider). The parser also lemmatizes (stocksstock) and masks named entities as [NER:ORG] placeholders so proper nouns do not bias the vector space. The 120-token SRAF generic stopword list is removed in the cleaning pass that follows.

Config(preprocessor=...) Backend Needs
"corenlp" (default, paper-faithful) Stanford CoreNLP via stanza.server [corenlp] extra + Java
"spacy" spaCy [spacy] extra + a model
"stanza" stanza Pipeline [stanza] extra
"static" NLTK MWETokenizer over a curated list base install
"none" whitespace tokenize, lowercase only base install

Step 1b, statistical (corpus-specific phrases). After Step 1a, gensim's Phrases scans the parsed corpus for statistically significant adjacent-token co-occurrences and joins them with _. A second pass over the bigram-joined corpus learns trigrams. This step identifies recurring collocations specific to the corpus: an earnings-call corpus surfaces forward_looking_statement and cost_of_capital; a Glassdoor-review corpus surfaces work_life_balance and toxic_environment.

Config(
    use_gensim_phrases=True,
    phrase_passes=2,            # 1 = bigrams; 2 = bigrams + trigrams
    phrase_min_count=10,        # works on a ~270k-doc corpus
    phrase_threshold=10.0,      # for smaller corpora try 3 / 5.0
)

The phrase-tagged corpus is written to work_dir/corpora/pass2.txt and can be opened directly to inspect the joined phrases.

Step 2: Word2Vec

Pipeline.train() fits a gensim.models.Word2Vec on the phrase-tagged corpus. Every word and phrase receives a 300-dimensional vector. Defaults match the 2021 paper:

Config(w2v_dim=300, w2v_window=5, w2v_min_count=5, w2v_epochs=20)

The model is saved at work_dir/models/w2v.mod and is available as p.w2v for ad-hoc queries.

Step 3: Seed expansion

Pipeline.expand_dictionary() builds the per-concept dictionary by:

  1. Averaging the in-vocabulary seed vectors for the concept.
  2. Taking the top n_words_dim (default 500) tokens by cosine similarity to that mean.
  3. Resolving cross-loadings: a token close to multiple concepts is assigned to the one whose seed mean it is closest to.
  4. Dropping [NER:*] placeholders so named entities never enter the dictionary.

The result is written to work_dir/outputs/expanded_dict.csv, one column per concept, sorted by descending similarity to the seed mean.

p.show_dictionary(top_k=10)         # prints per-concept seeds + top expansions
p.dictionary_preview(top_k=10)      # DataFrame for notebook display

Step 4: Manual dictionary inspection

Nearest-neighbor expansion surfaces noise: off-topic terms, industry-specific outliers, words too general to be informative. Two ways to remove them, both atomic across the in-memory dictionary and the on-disk CSV:

# Programmatic, replicable in a notebook:
p.edit_dictionary(
    remove={"risk": ["fantastic", "build"]},
    add={"risk": ["liability"]},
)

# Spreadsheet-driven, faster on a big dictionary:
#   1. open p.dict_path in Excel or any text editor
#   2. edit, save
#   3. p.reload_dictionary()

Cached scores are dropped after curation. Call p.score() to rescore against the curated dictionary.


Scoring

A document's score on a concept is the sum of TF-IDF weights for every dictionary token present in the document, divided by total document length.

Method Weight per dictionary hit Source
TFIDF tf · log(N/df) 2021 paper
TF tf extension
WFIDF (1 + log tf) · log(N/df) extension
TFIDF+SIMWEIGHT, WFIDF+SIMWEIGHT × 1/ln(2 + rank) extension

SIMWEIGHT variants additionally down-weight tokens further from the seed mean (rank in the expanded dictionary).

p.score(methods=("TFIDF",))
p.score_df("TFIDF")

Outputs land at work_dir/outputs/scores_<METHOD>.csv.


Loading documents and seeds

Pipeline(texts=[...], doc_ids=[...], work_dir=..., config=cfg)              # in-memory list
Pipeline.from_csv("docs.csv", text_col="text", id_col="id", ...)            # CSV
Pipeline.from_dataframe(df, text_col="text", id_col="id", ...)              # DataFrame
Pipeline.from_directory("./docs/", pattern="*.txt", ...)                    # one file per doc
Pipeline.from_text_file("docs.txt", id_path="ids.txt", ...)                 # one doc per line
Pipeline.from_jsonl("docs.jsonl", text_key="text", id_key="id", ...)        # JSONL

Seeds accept a Python dict, a JSON file, or a plain text file:

from lmsy_w2v_rfs import load_seeds
Config(seeds=load_seeds("my_seeds.json"))     # or .txt, or pass a dict directly

CLI: lmsy-w2v-rfs run --seeds my_seeds.txt --input docs.csv --input-format csv --out runs/x.


Large corpora

Once parsing finishes, downstream stages stream through disk: clean reads parsed sentences line by line; phrase and train use gensim's PathLineSentences so the training corpus is never fully materialized. The bottleneck is the input stage: the document loader holds the corpus in a Python list before parsing begins.

A scaling-friendly input format is one document per line in a single file, with an optional parallel IDs file.

from lmsy_w2v_rfs import Pipeline, Config, load_seeds

# transcripts.txt: one document per line. transcript_ids.txt: matching IDs.
p = Pipeline.from_text_file(
    "/data/transcripts.txt",
    id_path="/data/transcript_ids.txt",
    work_dir="runs/big",
    config=Config(
        seeds=load_seeds("my_seeds.txt"),
        preprocessor="spacy",                   # or "corenlp" for paper-faithful
        n_cores=8,
    ),
)
p.run()

If id_path is omitted, IDs are generated as line numbers ("0", "1", ...).

For corpora that exceed RAM (tens of millions of long documents), pre-split transcripts.txt into shards (split -l 100000 transcripts.txt shard_), run parse separately per shard, concatenate each shard's parsed/sentences.txt and parsed/sentence_ids.txt into a merged work_dir, then run clean, phrase, train, expand_dictionary, and score once on the merged corpus. The stage methods read from disk and are idempotent, so this kind of manual orchestration works without subclassing.

For smaller corpora (a few thousand to a hundred thousand documents), Pipeline.from_directory(...), Pipeline.from_csv(...), and Pipeline.from_dataframe(...) are convenient. They all materialize the corpus in memory at construction.


All knobs

Config(
    seeds=...,                         # required: dict[str, list[str]]

    # Step 1a
    preprocessor="corenlp",            # "corenlp" | "spacy" | "stanza" | "static" | "none"
    mwe_list=None,                     # None | "finance" | path to a curated list
    spacy_model="en_core_web_sm",
    n_cores=4,
    corenlp_memory="6G",
    corenlp_port=9002,

    # Step 1b
    use_gensim_phrases=True,
    phrase_passes=2,
    phrase_threshold=10.0,
    phrase_min_count=10,

    # Step 2
    w2v_dim=300,
    w2v_window=5,
    w2v_min_count=5,
    w2v_epochs=20,

    # Step 3
    n_words_dim=500,                   # paper's threshold for the dictionary cutoff
    dict_restrict_vocab=None,
    min_similarity=0.0,

    # Scoring (extensions beyond the 2021 paper)
    tfidf_normalize=False,
    zca_whiten=False,                  # ZCA-decorrelate the concept columns
    zca_epsilon=1e-6,

    random_state=42,
)

Citation

@article{li2021measuring,
  title={Measuring Corporate Culture Using Machine Learning},
  author={Li, Kai and Mai, Feng and Shen, Rui and Yan, Xinyan},
  journal={The Review of Financial Studies},
  volume={34}, number={7}, pages={3265--3315}, year={2021},
  doi={10.1093/rfs/hhaa079}
}

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmsy_w2v_rfs-0.1.2.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmsy_w2v_rfs-0.1.2-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file lmsy_w2v_rfs-0.1.2.tar.gz.

File metadata

  • Download URL: lmsy_w2v_rfs-0.1.2.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for lmsy_w2v_rfs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 90d2c682ed90c3e738bef5e58cd2885afb773e7eda1053106eed506a6d14a054
MD5 6b149eccbc63d8b7ad5047c807b21607
BLAKE2b-256 cf505d84fa7bb11d1fae601c142194963e7bb8e655dd1fa56261da0562eb5ffe

See more details on using hashes here.

File details

Details for the file lmsy_w2v_rfs-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: lmsy_w2v_rfs-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for lmsy_w2v_rfs-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8e8fcfbd8c2447f952765c007a83673b1238bd23d472434ef3b367b2f3f974d7
MD5 e80df23b8fafdb59401c493c41b3034c
BLAKE2b-256 d5bd57f39fa46c3c5d7dcbe1af5ecb4b5f88cac0a399689769522f9e7bd68216

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page