NLP tools to extract, normalize and filter sentences from text/HTML

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Phrasal

Forewords

What is it ?

Phrasal is a library of tools to help gather meaningful, proper sentences from websites.

Well, at least if used together. Each tool has a value of its own. For example, the Normalizer (my favorite!) is very useful for NLP, when you have a crappy text corpus you need to clean. The MocySplitter is a nice alternative to Moses when you need to cleverly split a stream of text into sentences, one per line. Etc.

Why was it developed ?

I have been working on a project lately, called SwissText that gathers Swiss German sentences from scraping the Internet (no kidding, see the LREC 2020 publication on arXiv). To do so, I had to build upon existing tools and develop some of my own. While they were initially for Swiss German, I figured that it would maybe be useful in other contexts, hence this repo which is a stripped-down version of some of the SwissText modules.

How does it work ?

This repo contains implementations of four types of tools, which constitute together a pipeline:

converter: extract (main) text from raw HTML;
normalizer: normalize the raw text, including the encoding, quotes, spaces, etc.;
splitter: split the text into chunks (potential sentences);
filterer: filter chunks to keep only "proper" sentences.

For each step, I propose one or more implementations.

Tools available

HtmlConverters

phrasal.BsConverter
A converter built upon BeautifulSoup that exact text found on the HTML. Text from code blocks, scripts or styles is ignored. It deals cleverly with encodings and always delivers text in UTF-8.
phrasal.JustextConverter
a converter based on justext, that try to spot and remove boilerplate content. By default, it only keeps "good" paragraph, that is text long enough to be a full sentences and with a low link density.

Normalizers

phrasal.Normalizer, or simply phrasal.normalize_text
Normalize some text (using a serie of homemade regexes), including: normalize spaces, replace combining diacritics by the accented letter codepoints and strip leftovers, normalize dashes and quotemarks, replace non-breakable spaces with regular ones, etc.
It can also try to fix encoding errors (see ftfy) and strip most unicode emoji symbols.

Splitters

phrasal.MosesSplitter
Moses' splitter split-sentences.perl completely rewritten in Python. It thus perfectly mimics the behavior, while being 5x faster than calling perl from Python (approach taken by MosesTokenizer for example).
phrasal.MocySplitter
An improvement upon MosesSplitter, which: deals more efficiently with lowercase (people are lazy on the Web), try to preserve links, can split on : or ; (optional), etc.

Filterers

phrasal.PatternSentenceFilter
A filterer based on a list of simple rules a proper sentence should respect, such as "at least five words", "no S P E L L E D words", etc.
What is awesome ? The rules are expressed in a (homemade) YAML-based syntax and are highly customizable. If you don't like the behavior, have a look at pattern_sentence_filter.yaml and try writing your own set of rules !

link_utils

The phrasal.link_utils module is a simple utility to process href links found on a page. It will resolve relative links (given a base URL), remove duplicates, strip anchors and exclude non-HTTP/HTTPs links.

To get the list of links from a URL (i.e. href found on the page main content), use extract_links:

import phrasal

all_links = phrasal.extract_links('https://github.com/derlin/phrasal')

How to use

Install the library using:

# regular install, one of:
python setup.py install 
pip install .

# for development, one of:
python setup.py develop
pip install -e .
pip install -e .[showcase] # for streamlit

As a library

import phrasal

Done.

From the command line

Each tool contains a command line interface with different arguments. Discover it by typing:

python -m phrasal --help

python -m phrasal --help
Call one of the tools from the command line. Usage: 
   classname [other arguments specific to classname]|[-h]

Allowed classname arguments:
 - BsConverter
 - JustextConverter
 - PatternSentenceFilter
 - MocySplitter
 - MosesSplitter
 - Normalizer

Here are some examples:

python -m phrasal JustextConverter -u https://icosys.ch/swisscrawl
=== from URL https://icosys.ch/swisscrawl
As part of the SwissTranslation project, SwissCrawl is a corpus of 500,000+ Swiss German (GSW)  [...]
[...]

python -m phrasal PatternSentenceFilter -i <(echo 'not-a-sentence\nYEAH !!!\nCet outil fonctionne très bien, je l’utilise tous les jours.')
Cet outil fonctionne très bien, je l’utilise tous les jours.

python -m phrasal Normalizer -i raw_text.txt -o clean_text.txt

I just need one tool...

No problem, each tool is more or less independent. You may want to simplify the code a bit (e.g. remove the interface inheritance, transform classes into static scripts, I don't know), but I hope the source code is self-explaining.

Running tests

Tests are using tox and pytest. The easiest way to run them is:

pip install tox tox-venv
tox

Running the showcase

A showcase using streamlit is included. It allows you to test the full pipeline straight from your browser and also play with the different tools and options from the Live Customizer. Once you found what works for you, you can simply copy-paste the code snippet generated.

Run the showcase locally by doing:

pip install streamlit
streamlit run src/showcase/lit.py

License

This work is licensed under Apache 2.0, so you can basically do anything with it.

However, I would really enjoy it if you credit me somehow, either by citing my name, send me an email to say hi (I get lonely sometime, may be nice to chat), leave a star on GitHub, or any other way you think may give me strength to keep doing open-source :blush:.

Related resources

get-html to get raw or renderer HTML (used in this repo)
SwissText
SwissTranslation project page
:octopus::octopus::octopus::octopus::octopus::octopus::octopus::octopus: (I just love octopuses)
Personal website

TODO

add some usecases, such as finding links, cleaning a text file, etc. add language support information

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.1

Mar 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phrasal-0.0.1.tar.gz (27.9 kB view details)

Uploaded Mar 7, 2020 Source

File details

Details for the file phrasal-0.0.1.tar.gz.

File metadata

Download URL: phrasal-0.0.1.tar.gz
Upload date: Mar 7, 2020
Size: 27.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for phrasal-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9220c66fff1fc1235f44b4bd0b0d29461ac98c1b4700ed571cb29ba358b481fe`
MD5	`0ff2475f011f00bb8dc26f73a4aaf57c`
BLAKE2b-256	`08a33aa4821af24e5bc475ad9c6d783ca31b43be037f8ccd5fb1ab00589c9154`

See more details on using hashes here.

phrasal 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Phrasal

Forewords

What is it ?

Why was it developed ?

How does it work ?

Tools available

How to use

As a library

From the command line

I just need one tool...

Running tests

Running the showcase

License

Related resources

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes