Customize phrasebanks from various texts or corpora.
Project description
Open Phrasebank
Building your own phrasebank. ✨
This repository provides an accessible phrase bank, which is a collection of frequently used phrases that can be utilized, for example, in the auto-complete function of an IDE. (Note: This library does not provide IDE or auto-complete functions but offers ready-to-use phrasebanks)
Moreover, this repository includes features for constructing a phrase bank from a provided text or an open corpus.
Why Use Phrase Bank
Boosting Typing Experience with Phrasebank 🚀
Academic Writing 🕵️♀
You can further customize the phrasebank according to your needs, e.g. for certain disciplines, for certain styles (descriptive, analytical, persuasive and critical), for certain sections (abstract, body text), as long as you can find good ingredients.
Open Phrasebanks
Academic Phrasebank
Elsevier OA CC-BY contains 40k articles from Elsevier's journals, including from Arts, Business, STEM to Social Sciences[^1].
No. | Phrasebank | Source | N of grams | Lines | Comments |
---|---|---|---|---|---|
1 | 📍academic_phrasebank | Book Academic Phrasebank 2014 | 2-5 | 2,190 | Extract from pdf (Zhihao, 2024) |
2 | 📍elsevier_phrasebank | Corpus Elsevier OA CC-BY 2020 | 2-6 | 3,792 | Extract by n-gram (Zhihao 2024) |
3 | 📍bawe_1000.csv | Corpus British Academic Written English | 4-6 | 1,000 | Due to inaccessible, only most frequent 1000 list here. (Zhihao, 2024) |
4 | 📍academic_word_list | Academic Word List Coxhead (2000) | 1 | 570 | The 570 word for academic English (exclude frequent 2000 words) |
5 | 📍elsevier_awl | 2,4 | 2-6 | 994 | The Elsevier phrasebank that contains AWL (Zhihao, 2024) |
6 | 📍elsevier_ENVI_EART | 2 | 2-7 | 3,700 | Environment & Earth Science 3700 collection (Zhihao 2024) |
7 | 📍elsevier_PSYC_SOCI | 2 | 2-7 | 3,700 | Social Science & Psychology 3700 collection (Zhihao 2024) |
8 | 📍elsevier_MEDI | 2 | 2-7 | 3,700 | Medicine 3700 collection (Zhihao 2024) |
[^1]:Over 20 disciplines orieg/elsevier-oa-cc-by · Datasets at Hugging Face
English Frequent Phrasebank
No. | Phrasebank | Source | N-gram Length | Lines | Comments |
---|---|---|---|---|---|
1 | 📍google-10000-english | Google Books Corpus | 1 | 10,000 | The 10,000 most common English words from Google Books Corpus |
2 | 📍Wordlist 1200.txt | Internet | 1 | 2,000 | The 2,000 most common English words |
Quickstart
You can download the pre-made phrasebank from the table. If you do require a custom one, go forward.
pip install openphrasebank
Get a Self-defined Phrasebank in 3 Steps
Below is an example based on n-gram frequency. More examples, e.g. extract from PDF, are available in documents.
1️⃣ Load and Tokenize the Data
import openphrasebank as opb
tokens_gen = opb.load_and_tokenize_data (dataset_name="orieg/elsevier-oa-cc-by",
subject_areas=['PSYC','SOCI'],
keys=['title', 'abstract','body_text'],
save_cache=True,
cache_file='temp_tokens.json')
2️⃣ Generate N-grams
n_values = [1,2,3,4,5,6,7,8]
opb.generate_multiple_ngrams(tokens_gen, n_values)
3️⃣ Filter and save
# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}
# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
phrases[n], freqs[n] = opb.filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)
# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))
# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_PSYC_SOCI.txt', 'w') as file:
for line in sorted_phrases:
file.write(line + '\n')
How to Contribute
You can either contribute the phrasebank or the code. Check out our contributing.
Known Issues
Phrasebank | Issues |
---|---|
academic_phrasebank | Due to the table in the PDF file not being properly handled, many sentences were not extracted correctly. (zhihao) |
elsevier_phrasebank |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for openphrasebank-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 807d5d4c4dd93b36e970fe2074ab2207e93dae45b7f943677c3332d0af5ce081 |
|
MD5 | ba294d1c29e1f14a9f7d730ac8aa1c46 |
|
BLAKE2b-256 | 1f3c6f4da264a06b2af1588054351ba2e3e6d5cad4ac3c4d76e39347549c8b3c |