Skip to main content

A way to quickly preprocess common NLP tasks. Specialised for english tasks requiring lightweight compute that you just want done

Project description

Bitte

Please, somebody get me a beer instead of another 4 hours preprocessing NLP!

This module speeds you up so you can get to the exciting tasks.

It is a quirky collection of the makeshift-tasks most NLP tools run into, but are either too recently relevant, too small, or too monolingual to have made it to the common NLP modules like spacy, nltk, huggingface transformers, spark etc.

You hit walls processing data with NLP

This module helps you overcome annoying wastes of time. Now you can glue together the jagged OCR output from some strange .epub file that once had a life as a .pdf file exported frrom .pptx, into your pristine and beautiful BERT acronym recognition model which ONLY WORKS WITH FULL SENTENCES AND NORMAL GRAMMAR[*1]

Quick starts

Repunctuate

repunctuated = bitte.repunctuate([list])
  • Combined functionality of modules like rpunct, NNSplitter(sentence splitting) and transformer models performing CoLA task. Runs quickly, you don't have to deal with the hassle of rpunct difficulties on windows or fast execution of a quantized DistilBERT model: It'll just work.

This tool has 0 reliance on external APIs. It does not use a large language model API under the hood. That's why it's fast and cheap.

Semantic Chunk Text

chunks: List[List[str]] = bitte.chunk_semantically(input_string)

Semantic search is amazing. There's so much. But the second or third time you run it, you'll get some weird disappointing half sentence returned as a result. Oh no, you realise. The real world isn't like Wikipedia. We aren't in Kansas. The OCR parser is in Kan sa5.

You're going to need the big guns. Semantic Chunk Text can deal with any string of text and make it reasonably semantically chunked together, ideal for retrieval tasks for semantic search and for embedding tasks.

This API functions similarly to the Autochapter API of Audacity, but it works for text and does not use transcript timing information.

This API isn't super fast as it runs transformer models often multiple times. It is cheap and quality.

Wholesome

sentence_classifications = bitte.are_full_sentences(sentences)
# Filter for full sentenecs only:
full_sentences = [sentence for ind, sentence in enumerate(sentences) if sentence_classifications[ind]]

Have you ever felt your sentence was incompl

Now, you can check if an english sentence is complete. This is useful for grammar merging decisions, detecting conversational interupts, and deciding whether to display questionable quality content to users or not.

This API isn't super fast, as it needs a transformer model too. It is cheap and quality.

Why now?

It's a fact that lots of NLP right now works better for english. Large language models; medium ones like BERT; OCR trained on english datasets... you get the point. If you speak english, you'll be tempted to take advantage of this somewhere in your pipeline. These tools are for you.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bitte-0.0.1.tar.gz (2.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bitte-0.0.1-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file bitte-0.0.1.tar.gz.

File metadata

  • Download URL: bitte-0.0.1.tar.gz
  • Upload date:
  • Size: 2.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.1

File hashes

Hashes for bitte-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8fa8e0affda6ba9eec037dd0ea8d5c651079ba474fd1592f0da0d6d1de1ae8e4
MD5 7eb89f91ac6be8f64d880b74bc59c360
BLAKE2b-256 742e972d2aab700ee79f954f543472b9c499b3c22695aa37fe9efd145443ae87

See more details on using hashes here.

File details

Details for the file bitte-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: bitte-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.1

File hashes

Hashes for bitte-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 60ecf12e832ff0889b55980ec3954b10e351b0d1beb0dfb4f83ba706178945a1
MD5 57d0e7cf15029329a427daf9b3aa3ab2
BLAKE2b-256 26a4579affa5988d3343eb1d2bb44765420b36f4241f881ca59db6eb5c496e2c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page