Python library for dealing with text
Project description
Vocabulary Extension
This project aspires to be a chrome extension that can parse through your screen and determine which vocabulary words you may be unfamiliar with.
Currently, it is a library that deals with text and web scraping, providing useful functions to aid the library's user.
Note: ReadtheDocs is failing but GitHub pages works fine. Error: "Some files were detected in an unsupported output path, '_build/html'. Ensure your project is configured to use the output path '$READTHEDOCS_OUTPUT/html' instead." _build/html is neccessary for GitHub pages to work.
Overview
This project is a library that can parse through a corpus of text and determine which vocabulary words you may be unfamiliar with. It also provides general text handling functions that can be useful when working on project involving text and scraping. It is naive in that it does not pre-determine your vocabulary level first. The ultimate goal is to turn this library into a usable web extension. Often times when we look at a website, we are confronted with new terms. Instead of having to individually right click on every single term to look up the definition, this extension will create a bank of vocab words on the article and display their meanings. If you click the extension's button, you will see the list of words and their definitions. You can also save words for future reference.
Quick Example
get_links(): This is a function that allows you to get all the links on a particular webpage.
Input
Output
Installation
- clone from GitHub or pip install Vocabulary-Extension==0.1.0
- Install virtual environment: python -m venv env
- Activate virtual env: source env/bin/activate
- Install the dependencies: pip install .[develop]
- python setup.py build
- make lint
- make test
- Running main: python3 vocab_project/vocab.py
Functions Available
X marks functions that have unit tests written
- [] get_soup(url) --> Returns scraped BeautifulSoup object
- [] get_content(soup) --> Returns main content of the page
- [] get_links(soup) --> Return array of links on page
- [] clean_corpus(corpus) --> Retain alpha-numeric characters and apostrophes
- retrieve_sentences(corpus) --> Tokenizes sentences using NLTK
- retrieve_all_words(corpus) --> Tokenizes words (including stop words) using NLTK
- retrieve_all_non_stop_words(corpus) --> Tokenizes non-stop-words
- word_count(corpus) --> Counts number of words (including stop words) in corpus
- individual_word_count(corpus) --> Counts number of times each individual word appears
- individual_word_count_non_stop_word --> Counts number of non-stop-words in corpus
- top_k_words(corpus, k) --> Finds top k words (excluding stop words)
- [] frequency_distributions(corpus) --> Returns a plot with freq distributions of non-stop words
- [] get_definition(word) --> Uses wordnet to retrieve definition
Functions To Be Implemented
- find_advanced_words(corpus)
- summarize()
Installation (manual)
- conda install beautifulsoup4
- mkdir env_holder
- cd env_holder
- Install virtual environment: python -m venv env
- Activate virtual env: source env_holder/env/bin/activate
- pip install requests
- pip install nltk
- pip install matplotlib
- pip install sklearn
- pip install scikit-learn
- pip install pandas
- pip install lxml
- pip install pytest
- pip install black
- pip install flake8
- pip install urlopen
- pip install check-manifest
- pip install pip-login (not for library user- just me to update PyPI)
- pip install sphinx
- pip install sphinx_rtd_theme
- pip install recommonmark
- pip install sphinxcontrib-napoleon
Upload to PyPI
- python -m pip install --upgrade pip
- python -m pip install --upgrade build
- python -m build
- python -m pip install --upgrade twine
- Upload to testPyPI: python3 -m twine upload --repository testpypi dist/*
- Upload to PyPI: twine upload dist/*
Libraries
-
Beautiful Soup: Python library to pull data out of HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
-
lxml library: parser that works well even with broken HTML code
-
requests
-
nltk
-
sklearn
-
pandas
Tools Used
- Static Analysis- CodeQL
- Dependency management- Dependapot
- Unit testing- PyTest
- Package manager- pip
- CI/CD- GitHub Actions
- Fake data- Fakr
- Linting- flake8
- Autoformatter- black
- Documentation- GitHub pages, Sphinx, Carbon (for picturing Code snippet)
Make Commands
make: list available commands make develop: install and build this library and its dependencies using pip make build: build the library using setuptools make lint: perform static analysis of this library with flake8 and black make format: autoformat this library using black make annotate: run type checking using mypy make test: run automated tests with pytest make coverage: run automated tests with pytest and collect coverage information make dist: package library for distribution
Testing Commands
Run either:
- make test
- python -m unittest vocab_project/tests/test_unit.py
- python -m unittest vocab_project/tests/test_integration.py
Useful Links
- https://www.youtube.com/watch?v=6tNS--WetLI&ab_channel=CoreySchafer
- https://realpython.com/python-testing/#writing-integration-tests
- https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_testing_with_scrapers.htm
Documentation
- https://sphinx-rtd-tutorial.readthedocs.io/en/latest/build-the-docs.html
- https://gist.github.com/GLMeece/222624fc495caf6f3c010a8e26577d31
- https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html
- https://stackoverflow.com/questions/10324393/sphinx-build-fail-autodoc-cant-import-find-module
- https://stackoverflow.com/questions/13516404/sphinx-error-unknown-directive-type-automodule-or-autoclass
RST Cheatsheets
- https://github.com/ralsina/rst-cheatsheet/blob/master/rst-cheatsheet.rst
- https://docs.typo3.org/m/typo3/docs-how-to-document/main/en-us/WritingReST/Reference/Code/Codeblocks.html
- https://sublime-and-sphinx-guide.readthedocs.io/en/latest/code_blocks.html
Running Documentation Locally
To (re)generate rsts for doctrings:
-
sphinx-apidoc -o ./source ../vocab_project
-
cd docs
-
make clean
-
make html
-
open build/html/index.html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for Vocabulary-Extension-0.2.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4c54818b485d8b6eaf8a0700f627ab32a740ed25e5ef1decaafc92746ef1022 |
|
MD5 | 412eb770e47ae6571e82e44709ce3b57 |
|
BLAKE2b-256 | 915e1375ade6e927d90477132fece6e4c04343a71062ef8b7b44c8d0390c1e80 |
Hashes for Vocabulary_Extension-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14c32157ec17ae108f066306eb5779121d42579def06df283a72515d623b0aca |
|
MD5 | 3e90c3db611a5da6594216172e25d56d |
|
BLAKE2b-256 | f141bfaefd8d9377c315cb8fa32776bd24506bc2b8080f3b331876709d99c379 |