Word-embedding seed expansion and document scoring. Bring your own seeds. Originally Li, Mai, Shen, Yan (2021, RFS).
Project description
Seed words expansion and measurements using Word2Vec (Li, Mai, Shen, Yan 2021 RFS)
Builds a corpus-specific measurement dictionary with Word2Vec. For each concept you want to measure in your corpus:
- You provide a short seed-word list per concept.
- The package builds a ranked dictionary of the words and multi-word phrases your corpus uses to express that concept.
- You curate the dictionary: inspect, drop noise, add domain words.
- The package scores every document by weighted hits against the curated dictionary.
If you find this library useful in your research, please cite:
Li, Kai, Feng Mai, Rui Shen, and Xinyan Yan (2021), "Measuring Corporate Culture Using Machine Learning," Review of Financial Studies 34(7):3265-3315, doi.org/10.1093/rfs/hhaa079.
Install
pip install -U lmsy_w2v_rfs
The default preprocessor (corenlp) needs Java and a one-time CoreNLP archive download:
pip install -U "lmsy_w2v_rfs[corenlp]"
lmsy-w2v-rfs download-corenlp # one-time, ~1 GB
Java-free alternatives:
pip install -U "lmsy_w2v_rfs[spacy]" && python -m spacy download en_core_web_sm
pip install -U "lmsy_w2v_rfs[stanza]"
pip install -U lmsy_w2v_rfs # bare; use preprocessor="static" or "none"
Quickstart
Two concepts, a few seed words each, four lines of pipeline:
from lmsy_w2v_rfs import Pipeline, Config
seeds = {
"risk": ["risk", "uncertainty", "volatility", "downside"],
"growth": ["growth", "expansion", "scale", "opportunity"],
}
texts = [
"Macro uncertainty and rising rates weighed on margins this quarter.",
"Strong customer demand drove double-digit revenue expansion across segments.",
"We hedged commodity exposure to limit downside from price volatility.",
"Investments in new markets are scaling our growth opportunity.",
# ... thousands more rows in practice
]
p = Pipeline(
texts=texts, doc_ids=[f"d{i}" for i in range(len(texts))],
work_dir="runs/quickstart",
config=Config(seeds=seeds, preprocessor="none"),
)
p.run() # phrase + train + expand + score
p.show_dictionary(top_k=10) # inspect the expanded dictionary
print(p.score_df("TFIDF")) # per-document scores
=== risk (12 words) ===
seeds: risk, uncertainty, volatility, downside
expanded: risk, uncertainty, volatility, downside, exposure,
commodity_exposure, rising_rates, hedge, macro_uncertainty
=== growth (14 words) ===
seeds: growth, expansion, scale, opportunity
expanded: growth, expansion, scale, opportunity, customer_demand,
new_markets, revenue_expansion, double_digit, scaling
| Doc_ID | risk | growth | document_length |
|---|---|---|---|
| d0 | 0.41 | 0.00 | 13 |
| d1 | 0.00 | 0.55 | 12 |
| d2 | 0.62 | 0.00 | 12 |
| d3 | 0.00 | 0.49 | 11 |
To reproduce the 2021 paper exactly:
from lmsy_w2v_rfs import load_example_seeds
seeds = load_example_seeds("culture_2021") # 47 seeds, 5 dimensions
The construction procedure
The package implements the four-step construction procedure of Li et al. (2021). Each step is a method on Pipeline; calling .run() executes them in order and saves intermediate artifacts under work_dir/ so any step can be redone without redoing the others.
Step 1: Two-step phrase construction
Phrases carry meaning that single words cannot. The package extracts them in two complementary steps targeting different kinds of phrases.
Step 1a, parser-based (general-English phrases). A dependency parser identifies fixed multiword expressions (with_respect_to, rather_than) and compound words (intellectual_property, healthcare_provider). The parser also lemmatizes (stocks → stock) and masks named entities as [NER:ORG] placeholders so proper nouns do not bias the vector space. The 120-token SRAF generic stopword list is removed in the cleaning pass that follows.
Config(preprocessor=...) |
Backend | Needs |
|---|---|---|
"corenlp" (default, paper-faithful) |
Stanford CoreNLP via stanza.server |
[corenlp] extra + Java |
"spacy" |
spaCy | [spacy] extra + a model |
"stanza" |
stanza Pipeline |
[stanza] extra |
"static" |
NLTK MWETokenizer over a curated list |
base install |
"none" |
whitespace tokenize, lowercase only | base install |
Step 1b, statistical (corpus-specific phrases). After Step 1a, gensim's Phrases scans the parsed corpus for statistically significant adjacent-token co-occurrences and joins them with _. A second pass over the bigram-joined corpus learns trigrams. This step identifies recurring collocations specific to the corpus: an earnings-call corpus surfaces forward_looking_statement and cost_of_capital; a Glassdoor-review corpus surfaces work_life_balance and toxic_environment.
Config(
use_gensim_phrases=True,
phrase_passes=2, # 1 = bigrams; 2 = bigrams + trigrams
phrase_min_count=10, # works on a ~270k-doc corpus
phrase_threshold=10.0, # for smaller corpora try 3 / 5.0
)
The phrase-tagged corpus is written to work_dir/corpora/pass2.txt and can be opened directly to inspect the joined phrases.
Step 2: Word2Vec
Pipeline.train() fits a gensim.models.Word2Vec on the phrase-tagged corpus. Every word and phrase receives a 300-dimensional vector. Defaults match the 2021 paper:
Config(w2v_dim=300, w2v_window=5, w2v_min_count=5, w2v_epochs=20)
The model is saved at work_dir/models/w2v.mod and is available as p.w2v for ad-hoc queries.
Step 3: Seed expansion
Pipeline.expand_dictionary() builds the per-concept dictionary by:
- Averaging the in-vocabulary seed vectors for the concept.
- Taking the top
n_words_dim(default 500) tokens by cosine similarity to that mean. - Resolving cross-loadings: a token close to multiple concepts is assigned to the one whose seed mean it is closest to.
- Dropping
[NER:*]placeholders so named entities never enter the dictionary.
The result is written to work_dir/outputs/expanded_dict.csv, one column per concept, sorted by descending similarity to the seed mean.
p.show_dictionary(top_k=10) # prints per-concept seeds + top expansions
p.dictionary_preview(top_k=10) # DataFrame for notebook display
Step 4: Manual dictionary inspection
Nearest-neighbor expansion surfaces noise: off-topic terms, industry-specific outliers, words too general to be informative. Two ways to remove them, both atomic across the in-memory dictionary and the on-disk CSV:
# Programmatic, replicable in a notebook:
p.edit_dictionary(
remove={"risk": ["fantastic", "build"]},
add={"risk": ["liability"]},
)
# Spreadsheet-driven, faster on a big dictionary:
# 1. open p.dict_path in Excel or any text editor
# 2. edit, save
# 3. p.reload_dictionary()
Cached scores are dropped after curation. Call p.score() to rescore against the curated dictionary.
Scoring
A document's score on a concept is the sum of TF-IDF weights for every dictionary token present in the document, divided by total document length.
| Method | Weight per dictionary hit | Source |
|---|---|---|
TFIDF |
tf · log(N/df) |
2021 paper |
TF |
tf |
extension |
WFIDF |
(1 + log tf) · log(N/df) |
extension |
TFIDF+SIMWEIGHT, WFIDF+SIMWEIGHT |
× 1/ln(2 + rank) |
extension |
SIMWEIGHT variants additionally down-weight tokens further from the seed mean (rank in the expanded dictionary).
p.score(methods=("TFIDF",))
p.score_df("TFIDF")
Outputs land at work_dir/outputs/scores_<METHOD>.csv.
Loading documents and seeds
Pipeline(texts=[...], doc_ids=[...], work_dir=..., config=cfg) # in-memory list
Pipeline.from_csv("docs.csv", text_col="text", id_col="id", ...) # CSV
Pipeline.from_dataframe(df, text_col="text", id_col="id", ...) # DataFrame
Pipeline.from_directory("./docs/", pattern="*.txt", ...) # one file per doc
Pipeline.from_text_file("docs.txt", id_path="ids.txt", ...) # one doc per line
Pipeline.from_jsonl("docs.jsonl", text_key="text", id_key="id", ...) # JSONL
Seeds accept a Python dict, a JSON file, or a plain text file:
from lmsy_w2v_rfs import load_seeds
Config(seeds=load_seeds("my_seeds.json")) # or .txt, or pass a dict directly
CLI: lmsy-w2v-rfs run --seeds my_seeds.txt --input docs.csv --input-format csv --out runs/x.
Large corpora
Once parsing finishes, downstream stages stream through disk: clean reads parsed sentences line by line; phrase and train use gensim's PathLineSentences so the training corpus is never fully materialized. The bottleneck is the input stage: the document loader holds the corpus in a Python list before parsing begins.
A scaling-friendly input format is one document per line in a single file, with an optional parallel IDs file.
from lmsy_w2v_rfs import Pipeline, Config, load_seeds
# transcripts.txt: one document per line. transcript_ids.txt: matching IDs.
p = Pipeline.from_text_file(
"/data/transcripts.txt",
id_path="/data/transcript_ids.txt",
work_dir="runs/big",
config=Config(
seeds=load_seeds("my_seeds.txt"),
preprocessor="spacy", # or "corenlp" for paper-faithful
n_cores=8,
),
)
p.run()
If id_path is omitted, IDs are generated as line numbers ("0", "1", ...).
For corpora that exceed RAM (tens of millions of long documents), pre-split transcripts.txt into shards (split -l 100000 transcripts.txt shard_), run parse separately per shard, concatenate each shard's parsed/sentences.txt and parsed/sentence_ids.txt into a merged work_dir, then run clean, phrase, train, expand_dictionary, and score once on the merged corpus. The stage methods read from disk and are idempotent, so this kind of manual orchestration works without subclassing.
For smaller corpora (a few thousand to a hundred thousand documents), Pipeline.from_directory(...), Pipeline.from_csv(...), and Pipeline.from_dataframe(...) are convenient. They all materialize the corpus in memory at construction.
All knobs
Config(
seeds=..., # required: dict[str, list[str]]
# Step 1a
preprocessor="corenlp", # "corenlp" | "spacy" | "stanza" | "static" | "none"
mwe_list=None, # None | "finance" | path to a curated list
spacy_model="en_core_web_sm",
n_cores=4,
corenlp_memory="6G",
corenlp_port=9002,
# Step 1b
use_gensim_phrases=True,
phrase_passes=2,
phrase_threshold=10.0,
phrase_min_count=10,
# Step 2
w2v_dim=300,
w2v_window=5,
w2v_min_count=5,
w2v_epochs=20,
# Step 3
n_words_dim=500, # paper's threshold for the dictionary cutoff
dict_restrict_vocab=None,
min_similarity=0.0,
# Scoring (extensions beyond the 2021 paper)
tfidf_normalize=False,
zca_whiten=False, # ZCA-decorrelate the concept columns
zca_epsilon=1e-6,
random_state=42,
)
Citation
@article{li2021measuring,
title={Measuring Corporate Culture Using Machine Learning},
author={Li, Kai and Mai, Feng and Shen, Rui and Yan, Xinyan},
journal={The Review of Financial Studies},
volume={34}, number={7}, pages={3265--3315}, year={2021},
doi={10.1093/rfs/hhaa079}
}
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lmsy_w2v_rfs-0.1.3.tar.gz.
File metadata
- Download URL: lmsy_w2v_rfs-0.1.3.tar.gz
- Upload date:
- Size: 54.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83bc08960d70a6a4fc02133d0ba0f0fc0cc7f0a57abe60d327e23f1983cf7491
|
|
| MD5 |
468361b4cc7c76eac525e04c59c1ea54
|
|
| BLAKE2b-256 |
230ab96bc14b73b2869c60a9e6f6071efdcfadcfbdec13a82730ff5963d2f475
|
File details
Details for the file lmsy_w2v_rfs-0.1.3-py3-none-any.whl.
File metadata
- Download URL: lmsy_w2v_rfs-0.1.3-py3-none-any.whl
- Upload date:
- Size: 51.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0df14da1f0ef3100b18024bdfc7d1d61c5aca0135cc658cd28755ca1f1f4ac3
|
|
| MD5 |
039a2d3122e287d954b60b96bc7efbbd
|
|
| BLAKE2b-256 |
523bee91567f8551e696f73b64def3f95b04614e6415ca3acea0a1ab6e3cc3c3
|