Skip to main content

Customize phrasebanks from various texts or corpora.

Project description

Open Phrasebank

Building your own phrasebank. ✨

Documentation Status PyPI - Version GitHub Action GitHub License Docker Pulls

This repository provides an accessible phrase bank, which is a collection of frequently used phrases that can be utilized, for example, in the auto-complete function of an IDE. (Note: This library does not provide IDE or auto-complete functions but offers ready-to-use phrasebanks)

Moreover, this repository includes features for constructing a phrase bank from a provided text or an open corpus.

Why Use Phrase Bank

Boosting Typing Experience with Phrasebank 🚀

Academic Writing 🕵️‍♀

You can further customize the phrasebank according to your needs, e.g. for certain disciplines, for certain styles (descriptive, analytical, persuasive and critical), for certain sections (abstract, body text), as long as you can find good ingredients.

Open Phrasebanks

Academic Phrasebank

Elsevier OA CC-BY contains 40k articles from Elsevier's journals, including from Arts, Business, STEM to Social Sciences[^1].

No. Phrasebank Source N of grams Lines Comments
1 📍academic_phrasebank Book Academic Phrasebank 2014 2-5 2,190 Extract from pdf (Zhihao, 2024)
2 📍elsevier_phrasebank Corpus Elsevier OA CC-BY 2020 2-6 3,792 Extract by n-gram (Zhihao 2024)
3 📍bawe_1000.csv Corpus British Academic Written English 4-6 1,000 Due to inaccessible, only most frequent 1000 list here. (Zhihao, 2024)
4 📍academic_word_list Academic Word List Coxhead (2000) 1 570 The 570 word for academic English (exclude frequent 2000 words)
5 📍elsevier_awl 2,4 2-6 994 The Elsevier phrasebank that contains AWL (Zhihao, 2024)
6 📍elsevier_ENVI_EART 2 2-7 3,700 Environment & Earth Science 3700 collection (Zhihao 2024)
7 📍elsevier_PSYC_SOCI 2 2-7 3,700 Social Science & Psychology 3700 collection (Zhihao 2024)
8 📍elsevier_MEDI 2 2-7 3,700 Medicine 3700 collection (Zhihao 2024)

[^1]:Over 20 disciplines orieg/elsevier-oa-cc-by · Datasets at Hugging Face

English Frequent Phrasebank

No. Phrasebank Source N-gram Length Lines Comments
1 📍google-10000-english Google Books Corpus 1 10,000 The 10,000 most common English words from Google Books Corpus
2 📍Wordlist 1200.txt Internet 1 2,000 The 2,000 most common English words

Quickstart

You can download the pre-made phrasebank from the table. If you do require a custom one, go forward.

pip install openphrasebank

Get a Self-defined Phrasebank in 3 Steps

Below is an example based on n-gram frequency. More examples, e.g. extract from PDF, are available in documents.

1️⃣ Load and Tokenize the Data

import openphrasebank as opb

tokens_gen = opb.load_and_tokenize_data (dataset_name="orieg/elsevier-oa-cc-by", 
                                         subject_areas=['PSYC','SOCI'],
                                         keys=['title', 'abstract','body_text'],
                                         save_cache=True,
                                         cache_file='temp_tokens.json')

2️⃣ Generate N-grams

n_values = [1,2,3,4,5,6,7,8]
opb.generate_multiple_ngrams(tokens_gen, n_values)

3️⃣ Filter and save

# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}

# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
    phrases[n], freqs[n] = opb.filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)

# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))

# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_PSYC_SOCI.txt', 'w') as file:
    for line in sorted_phrases:
        file.write(line + '\n')

How to Contribute

You can either contribute the phrasebank or the code. Check out our contributing.

Known Issues

Phrasebank Issues
academic_phrasebank Due to the table in the PDF file not being properly handled, many sentences were not extracted correctly. (zhihao)
elsevier_phrasebank

ko-fi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openphrasebank-0.1.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openphrasebank-0.1.1-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file openphrasebank-0.1.1.tar.gz.

File metadata

  • Download URL: openphrasebank-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for openphrasebank-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0d4cef16f45f076883fa82e26f2a9beffd1f359b35a264a1412103a76415f412
MD5 d070d4e06dccaca051794671a12d4722
BLAKE2b-256 c2ab21913cf7a37ea089452aa0327a2254226e6b071e8fc1139d3244fb18b3a3

See more details on using hashes here.

File details

Details for the file openphrasebank-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: openphrasebank-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for openphrasebank-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 807d5d4c4dd93b36e970fe2074ab2207e93dae45b7f943677c3332d0af5ce081
MD5 ba294d1c29e1f14a9f7d730ac8aa1c46
BLAKE2b-256 1f3c6f4da264a06b2af1588054351ba2e3e6d5cad4ac3c4d76e39347549c8b3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page