Word-embedding seed expansion and document scoring. Bring your own seeds. Originally Li, Mai, Shen, Yan (2021, RFS).

These details have not been verified by PyPI

Project links

Project description

Seed words expansion and measurements using Word2Vec (Li, Mai, Shen, Yan 2021 RFS)

Builds a corpus-specific measurement dictionary with Word2Vec. For each concept you want to measure in your corpus:

You provide a short seed-word list per concept.
The package builds a ranked dictionary of the words and multi-word phrases your corpus uses to express that concept.
You curate the dictionary: inspect, drop noise, add domain words.
The package scores every document by weighted hits against the curated dictionary.

If you find this library useful in your research, please cite:

Li, Kai, Feng Mai, Rui Shen, and Xinyan Yan (2021), "Measuring Corporate Culture Using Machine Learning," Review of Financial Studies 34(7):3265-3315, doi.org/10.1093/rfs/hhaa079.

Install

pip install -U lmsy_w2v_rfs

The default preprocessor (corenlp) needs Java and a one-time CoreNLP archive download:

pip install -U "lmsy_w2v_rfs[corenlp]"
lmsy-w2v-rfs download-corenlp                 # one-time, ~1 GB

Java-free alternatives:

pip install -U "lmsy_w2v_rfs[spacy]" && python -m spacy download en_core_web_sm
pip install -U "lmsy_w2v_rfs[stanza]"
pip install -U lmsy_w2v_rfs                   # bare; use preprocessor="static" or "none"

Quickstart

Two concepts, a few seed words each, four lines of pipeline:

from lmsy_w2v_rfs import Pipeline, Config

seeds = {
    "risk":   ["risk", "uncertainty", "volatility", "downside"],
    "growth": ["growth", "expansion", "scale", "opportunity"],
}
texts = [
    "Macro uncertainty and rising rates weighed on margins this quarter.",
    "Strong customer demand drove double-digit revenue expansion across segments.",
    "We hedged commodity exposure to limit downside from price volatility.",
    "Investments in new markets are scaling our growth opportunity.",
    # ... thousands more rows in practice
]

p = Pipeline(
    texts=texts, doc_ids=[f"d{i}" for i in range(len(texts))],
    work_dir="runs/quickstart",
    config=Config(seeds=seeds, preprocessor="none"),
)
p.run()                     # phrase + train + expand + score
p.show_dictionary(top_k=10) # inspect the expanded dictionary
print(p.score_df("TFIDF"))  # per-document scores

=== risk (12 words) ===
  seeds:    risk, uncertainty, volatility, downside
  expanded: risk, uncertainty, volatility, downside, exposure,
            commodity_exposure, rising_rates, hedge, macro_uncertainty
=== growth (14 words) ===
  seeds:    growth, expansion, scale, opportunity
  expanded: growth, expansion, scale, opportunity, customer_demand,
            new_markets, revenue_expansion, double_digit, scaling

Doc_ID	risk	growth	document_length
d0	0.41	0.00	13
d1	0.00	0.55	12
d2	0.62	0.00	12
d3	0.00	0.49	11

To reproduce the 2021 paper exactly:

from lmsy_w2v_rfs import load_example_seeds
seeds = load_example_seeds("culture_2021")    # 47 seeds, 5 dimensions

The construction procedure

The package implements the four-step construction procedure of Li et al. (2021). Each step is a method on Pipeline; calling .run() executes them in order and saves intermediate artifacts under work_dir/ so any step can be redone without redoing the others.

Step 1: Two-step phrase construction

Phrases carry meaning that single words cannot. The package extracts them in two complementary steps targeting different kinds of phrases.

Step 1a, parser-based (general-English phrases). A dependency parser identifies fixed multiword expressions (with_respect_to, rather_than) and compound words (intellectual_property, healthcare_provider). The parser also lemmatizes (stocks → stock) and masks named entities as [NER:ORG] placeholders so proper nouns do not bias the vector space. The 120-token SRAF generic stopword list is removed in the cleaning pass that follows.

`Config(preprocessor=...)`	Backend	Needs
`"corenlp"` (default, paper-faithful)	Stanford CoreNLP via `stanza.server`	`[corenlp]` extra + Java
`"spacy"`	spaCy	`[spacy]` extra + a model
`"stanza"`	stanza `Pipeline`	`[stanza]` extra
`"static"`	NLTK `MWETokenizer` over a curated list	base install
`"none"`	whitespace tokenize, lowercase only	base install

Step 1b, statistical (corpus-specific phrases). After Step 1a, gensim's Phrases scans the parsed corpus for statistically significant adjacent-token co-occurrences and joins them with _. A second pass over the bigram-joined corpus learns trigrams. This step identifies recurring collocations specific to the corpus: an earnings-call corpus surfaces forward_looking_statement and cost_of_capital; a Glassdoor-review corpus surfaces work_life_balance and toxic_environment.

Config(
    use_gensim_phrases=True,
    phrase_passes=2,            # 1 = bigrams; 2 = bigrams + trigrams
    phrase_min_count=10,        # works on a ~270k-doc corpus
    phrase_threshold=10.0,      # for smaller corpora try 3 / 5.0
)

The phrase-tagged corpus is written to work_dir/corpora/pass2.txt and can be opened directly to inspect the joined phrases.

Step 2: Word2Vec

Pipeline.train() fits a gensim.models.Word2Vec on the phrase-tagged corpus. Every word and phrase receives a 300-dimensional vector. Defaults match the 2021 paper:

Config(w2v_dim=300, w2v_window=5, w2v_min_count=5, w2v_epochs=20)

The model is saved at work_dir/models/w2v.mod and is available as p.w2v for ad-hoc queries.

Step 3: Seed expansion

Pipeline.expand_dictionary() builds the per-concept dictionary by:

Averaging the in-vocabulary seed vectors for the concept.
Taking the top n_words_dim (default 500) tokens by cosine similarity to that mean.
Resolving cross-loadings: a token close to multiple concepts is assigned to the one whose seed mean it is closest to.
Dropping [NER:*] placeholders so named entities never enter the dictionary.

The result is written to work_dir/outputs/expanded_dict.csv, one column per concept, sorted by descending similarity to the seed mean.

p.show_dictionary(top_k=10)         # prints per-concept seeds + top expansions
p.dictionary_preview(top_k=10)      # DataFrame for notebook display

Step 4: Manual dictionary inspection

Nearest-neighbor expansion surfaces noise: off-topic terms, industry-specific outliers, words too general to be informative. Two ways to remove them, both atomic across the in-memory dictionary and the on-disk CSV:

# Programmatic, replicable in a notebook:
p.edit_dictionary(
    remove={"risk": ["fantastic", "build"]},
    add={"risk": ["liability"]},
)

# Spreadsheet-driven, faster on a big dictionary:
#   1. open p.dict_path in Excel or any text editor
#   2. edit, save
#   3. p.reload_dictionary()

Cached scores are dropped after curation. Call p.score() to rescore against the curated dictionary.

Scoring

A document's score on a concept is the sum of TF-IDF weights for every dictionary token present in the document, divided by total document length.

Method	Weight per dictionary hit	Source
`TFIDF`	`tf · log(N/df)`	2021 paper
`TF`	`tf`	extension
`WFIDF`	`(1 + log tf) · log(N/df)`	extension
`TFIDF+SIMWEIGHT`, `WFIDF+SIMWEIGHT`	× `1/ln(2 + rank)`	extension

SIMWEIGHT variants additionally down-weight tokens further from the seed mean (rank in the expanded dictionary).

p.score(methods=("TFIDF",))
p.score_df("TFIDF")

Outputs land at work_dir/outputs/scores_<METHOD>.csv.

Loading documents and seeds

Pipeline(texts=[...], doc_ids=[...], work_dir=..., config=cfg)              # in-memory list
Pipeline.from_csv("docs.csv", text_col="text", id_col="id", ...)            # CSV
Pipeline.from_dataframe(df, text_col="text", id_col="id", ...)              # DataFrame
Pipeline.from_directory("./docs/", pattern="*.txt", ...)                    # one file per doc
Pipeline.from_text_file("docs.txt", id_path="ids.txt", ...)                 # one doc per line
Pipeline.from_jsonl("docs.jsonl", text_key="text", id_key="id", ...)        # JSONL

Seeds accept a Python dict, a JSON file, or a plain text file:

from lmsy_w2v_rfs import load_seeds
Config(seeds=load_seeds("my_seeds.json"))     # or .txt, or pass a dict directly

CLI: lmsy-w2v-rfs run --seeds my_seeds.txt --input docs.csv --input-format csv --out runs/x.

Large corpora

Once parsing finishes, downstream stages stream through disk: clean reads parsed sentences line by line; phrase and train use gensim's PathLineSentences so the training corpus is never fully materialized. The bottleneck is the input stage: the document loader holds the corpus in a Python list before parsing begins.

A scaling-friendly input format is one document per line in a single file, with an optional parallel IDs file.

from lmsy_w2v_rfs import Pipeline, Config, load_seeds

# transcripts.txt: one document per line. transcript_ids.txt: matching IDs.
p = Pipeline.from_text_file(
    "/data/transcripts.txt",
    id_path="/data/transcript_ids.txt",
    work_dir="runs/big",
    config=Config(
        seeds=load_seeds("my_seeds.txt"),
        preprocessor="spacy",                   # or "corenlp" for paper-faithful
        n_cores=8,
    ),
)
p.run()

If id_path is omitted, IDs are generated as line numbers ("0", "1", ...).

For corpora that exceed RAM (tens of millions of long documents), pre-split transcripts.txt into shards (split -l 100000 transcripts.txt shard_), run parse separately per shard, concatenate each shard's parsed/sentences.txt and parsed/sentence_ids.txt into a merged work_dir, then run clean, phrase, train, expand_dictionary, and score once on the merged corpus. The stage methods read from disk and are idempotent, so this kind of manual orchestration works without subclassing.

For smaller corpora (a few thousand to a hundred thousand documents), Pipeline.from_directory(...), Pipeline.from_csv(...), and Pipeline.from_dataframe(...) are convenient. They all materialize the corpus in memory at construction.

All knobs

Config(
    seeds=...,                         # required: dict[str, list[str]]

    # Step 1a
    preprocessor="corenlp",            # "corenlp" | "spacy" | "stanza" | "static" | "none"
    mwe_list=None,                     # None | "finance" | path to a curated list
    spacy_model="en_core_web_sm",
    n_cores=4,
    corenlp_memory="6G",
    corenlp_port=9002,

    # Step 1b
    use_gensim_phrases=True,
    phrase_passes=2,
    phrase_threshold=10.0,
    phrase_min_count=10,

    # Step 2
    w2v_dim=300,
    w2v_window=5,
    w2v_min_count=5,
    w2v_epochs=20,

    # Step 3
    n_words_dim=500,                   # paper's threshold for the dictionary cutoff
    dict_restrict_vocab=None,
    min_similarity=0.0,

    # Scoring (extensions beyond the 2021 paper)
    tfidf_normalize=False,
    zca_whiten=False,                  # ZCA-decorrelate the concept columns
    zca_epsilon=1e-6,

    random_state=42,
)

Citation

@article{li2021measuring,
  title={Measuring Corporate Culture Using Machine Learning},
  author={Li, Kai and Mai, Feng and Shen, Rui and Yan, Xinyan},
  journal={The Review of Financial Studies},
  volume={34}, number={7}, pages={3265--3315}, year={2021},
  doi={10.1093/rfs/hhaa079}
}

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Apr 28, 2026

0.1.3

Apr 28, 2026

This version

0.1.2

Apr 28, 2026

0.1.1

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmsy_w2v_rfs-0.1.2.tar.gz (54.0 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmsy_w2v_rfs-0.1.2-py3-none-any.whl (50.9 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file lmsy_w2v_rfs-0.1.2.tar.gz.

File metadata

Download URL: lmsy_w2v_rfs-0.1.2.tar.gz
Upload date: Apr 28, 2026
Size: 54.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for lmsy_w2v_rfs-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`90d2c682ed90c3e738bef5e58cd2885afb773e7eda1053106eed506a6d14a054`
MD5	`6b149eccbc63d8b7ad5047c807b21607`
BLAKE2b-256	`cf505d84fa7bb11d1fae601c142194963e7bb8e655dd1fa56261da0562eb5ffe`

See more details on using hashes here.

File details

Details for the file lmsy_w2v_rfs-0.1.2-py3-none-any.whl.

File metadata

Download URL: lmsy_w2v_rfs-0.1.2-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 50.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for lmsy_w2v_rfs-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e8fcfbd8c2447f952765c007a83673b1238bd23d472434ef3b367b2f3f974d7`
MD5	`e80df23b8fafdb59401c493c41b3034c`
BLAKE2b-256	`d5bd57f39fa46c3c5d7dcbe1af5ecb4b5f88cac0a399689769522f9e7bd68216`

See more details on using hashes here.

lmsy-w2v-rfs 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Seed words expansion and measurements using Word2Vec (Li, Mai, Shen, Yan 2021 RFS)

Install

Quickstart

The construction procedure

Step 1: Two-step phrase construction

Step 2: Word2Vec

Step 3: Seed expansion

Step 4: Manual dictionary inspection

Scoring

Loading documents and seeds

Large corpora

All knobs

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes