Skip to main content

No project description provided

Project description

to_paragraphs

Note, the code is currently ~slow and messy. It's a work in progress.

Extracts paragraphs from a string, using the semantic similarity between sentences to determine paragraph boundaries. This tends to work better than the naive approach of splitting on newlines.

Installation

pip install to_paragraphs

Example

from to_paragraphs import to_paragraphs

text = """
The biosphere includes everything living on Earth it is also known as ecosphere. Currently the biosphere has a biomass (or amount of living things) at around 1900 gigatonnes of carbon. It is not certain exactly how thick the biosphere is, though scientists predict that it is around 12,500 meters. The biosphere extends to the upper areas of the atmosphere, including birds and insects. 
Pizza is an Italian food that was created in Italy (The Naples area). It is made with different toppings. Some of the most common toppings are cheese, sausages, pepperoni, vegetables, tomatoes, spices and herbs and basil. These toppings are added over a piece of bread covered with sauce. The sauce is most often tomato-based, but butter-based sauces are used, too. The piece of bread is usually called a "pizza crust". Almost any kind of topping can be put over a pizza. The toppings used are different in different parts of the world. Pizza comes from Italy from Neapolitan cuisine. However, it has become popular in many parts of the world.
"""


def embed_fn(content):
    # ... some function that returns an embedding for a string
    pass


paragraphs = to_paragraphs(text, embed_fn=embed_fn)

for paragraph in paragraphs:
    print(paragraph)  # prints the paragraph about the biosphere, then the paragraph about pizza

Example embedding function

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

CACHE_PATH = '/tmp/transformers'
TOKENIZER = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2', cache_dir=CACHE_PATH)
MODEL = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2', cache_dir=CACHE_PATH)


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def embed(sentences):
    encoded_input = TOKENIZER(sentences, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        model_output = MODEL(**encoded_input)

    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    return F.normalize(sentence_embeddings, p=2, dim=1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

to_paragraphs-0.2.1.tar.gz (3.0 kB view details)

Uploaded Source

Built Distribution

to_paragraphs-0.2.1-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file to_paragraphs-0.2.1.tar.gz.

File metadata

  • Download URL: to_paragraphs-0.2.1.tar.gz
  • Upload date:
  • Size: 3.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/5.15.0-70-generic

File hashes

Hashes for to_paragraphs-0.2.1.tar.gz
Algorithm Hash digest
SHA256 193ba22d2f369e279af5b48c61a0f1a03c79d49cf24aebbeac8a0d0ace5be920
MD5 1c0fdc6b33d28dcd0ddd7c76275723c9
BLAKE2b-256 82a0cb460e7d08957180c470a97a3e5725df0fbd54e19443dae0889dd67507b8

See more details on using hashes here.

File details

Details for the file to_paragraphs-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: to_paragraphs-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/5.15.0-70-generic

File hashes

Hashes for to_paragraphs-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eab61baa5727ef11ed5c8efa08b0146be1c0c0472b8c88b99f24eb24d1fafa01
MD5 0b8d1311243bb8499379ab208fef8f22
BLAKE2b-256 8b336a7cf56410e16a28215ff4b1c6e1fdc23c566db027b72b3d16874d77fb6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page