Skip to main content

NLP tool for scraping text from a corpus of PDF files, embedding the sentences in the text and finding semantically similar sentences to a given search query

Project description

Build Badge PyPI version Licence Coverage Status

NLP tool for scraping text from a corpus of PDF files, embedding the sentences in the text and finding semantically similar sentences to a given search query

The code in this repository performs 3 main tasks.

  • Scraping the text from a corpus of PDF files. The text is then cleaned, split into sentences, and saved into a pd.DataFrame, .csv or .parquet file containing 3 columns. One column contains the text of all the PDFs in the corpus (one sentence per row), the second column contains the title of the PDF where each sentence is taken from, and the third column contains the number of the page where each sentence is located within that PDF. This enables easy lookup.
  • Embedding all the scraped sentences in the corpus of PDFs using three different NLP models: Word2Vec (with the option to include Tf-Idf weights), ELMo and BERT. For each model, sentence-level embeddings are generated.
  • Corpus querying. This is in the form of a search tool, where the user can input a search query (one to a few words), and the tool will output the most similar sentences in the PDF corpus to the user query. This is done by comparing the embedding of the user query against all the embeddings of each sentence in the scraped corpus of PDFs. This effectively acts as a search engine. It is important that the model used to embed the user's search query matches the NLP model used to embed the PDF corpus. The default similarity metric is cosine similarity, although this can be changed by the user.

Project Organization

├── LICENSE            <- The full Licence text. This project is released under the MIT Licence.
├── Makefile           <- Makefile with commands like `make data` or `make train`
├──          <- The top-level README for developers using this project.
├── .envrc             <- The file containing the set up for environment variables (required if using the runner 
│                         scripts). `$PWD` should correspond to the directory where you clone this repository.
├── .gitignore         <- The files (including data) which are not uploaded to GitHub. Edit as required.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling. This is where you cleaned datasets will be saved.
│   └── raw            <- The original, immutable data dump.
│       └── pdfs       <- Where your PDF files are stored.
├── models             <- Trained and serialized models. This is where your NLP models will be saved. No models have 
│                         been uploaded to GitHub.
├── requirements.txt   <- The requirements file for reproducing the analysis environment.
├──           <- Makes project pip installable (pip install -e .) so pdf2emb_nlp can be imported
├──        <- Tells which package data to include and exclude.
├── config             <- This folders stores configuration files (for example suggested filenames for saving 
│   │                     specific objects) that are read in by some of the runner scripts. Edit as required.
│   ├── filenames.json            
│   └── words_to_replace.json
├── scripts            <- Executable scripts are saved here. They should be run in the order listed below.
│   ├──   
│   ├──
│   ├──
│   └── tmp            <- The folder where the loggers are saved (for example, debug.log, info.log, warning.log)
│                         Logs have not been uploaded to GitHub.
├── pdf2emb_nlp        <- Source code for use in this project. See description below for how to use the files.
│   ├──    <- Makes pdf2emb_nlp a Python module
│   ├──
│   ├──
│   ├──
│   ├── logging.yaml
│   ├──
│   └──
├── tests              <- Unit tests for all functions and methods defined in all modules within the pdf2emb_nlp folder, to                
│   │                     be run using pytest. It also includes an end-to-end test. These should not be modified by
│   │                     the user.  
│   ├──
│   ├──
│   ├──
│   ├──
│   ├──
│   ├──
│   └── fixtures       <- This folder contains all the pytest fixtures required to run the tests. These should not
│       │                 be modified by the user.                
│       ├── dummy_embeddings.npy
│       ├── dummy_sentences.csv
│       ├── dummy_sentences.parquet
│       ├── dummy_sentences.txt
│       ├── expected_bert_embeddings.npy
│       ├── expected_elmo_embeddings.npy
│       ├──  expected_tfidf_scores.json
│       ├── expected_w2v_embeddings_tfidf_false.npy
│       ├── expected_w2v_embeddings_tfidf_true.npy
│       ├── full_df_with_embeddings.parquet.gzip
│       ├── test_pdf_1.pdf
│       ├── test_pdf_2.pdf
│       ├── tfidf_vectorizer.pickle
│       ├── word2vec.pickle
│       ├── word2vec_tfidf.pickle
│       └── words_to_replace.json
└── tox.ini            <- tox file with settings for running tox; see

Clone this repository from GitHub, or install this project by running

$ pip install pdf2emb_nlp

on the terminal command (or pip3, as appropriate). If you have cloned it from GitHub, you can run a test to ensure your environment is properly set-up. This project has not been tested on versions of Python older than 3.6, and some versions of the numpy library older than 1.17 are also known to cause issues. Please run the following line in your terminal

$ PYTHONHASHSEED=123 python3 -m pytest

There should be 35 tests. If they all pass, you're good to start using this package. If some of the tests fail, please check your environment. This project has only been tested with the environment as described in the requirements.txt file. Note that the environment variable PYTHONHASHSEED must be set to "123" while running the tests, to ensure deterministic reproducibility of the Word2Vec models. Two tests will fail if this is not set up correctly.

Each module has been fully documented.
Before you start, please configure your environment variables according to your own directory path. Please refer to the .envrc file, where $PWD corresponds to the directory path where this repository has been cloned.

In order to scrape the text from a corpus of PDF files, you will need to save your PDFs in the folder (~/data/raw/pdfs). Alternatively to scraping files from local storage, this package also supports cloud storage on AWS S3 buckets only (other cloud storage solutions are not natively supported). You can make use of the script to scrape the PDFs, clean the text, split all the text into sentences, and save this into a .csv file. The script imports the two modules scraper and arrange_text.

import os
import yaml
import logging.config
from pdf2emb_nlp.scraper import DocumentScraper
from pdf2emb_nlp.arrange_text import CorpusGenerator

if __name__ == "__main__":
    DATA_DIR = os.getenv('DATA_DIR')
    CONFIG_DIR = os.getenv('CONFIG_DIR')

    with open(LOGGING_CONFIG, 'r') as f:
        config = yaml.safe_load(f)

    pdfs_folder = os.path.join(DATA_DIR, 'raw', 'pdfs')
    json_path = os.path.join(CONFIG_DIR, 'words_to_replace.json')
    scraper = DocumentScraper(pdfs_folder, json_path)
    df_by_page = scraper.document_corpus_to_pandas_df()
    generator = CorpusGenerator(df_by_page)
    df_by_sentence = generator.df_by_page_to_df_by_sentence()

    df_by_page.to_csv(os.path.join(DATA_DIR, 'processed', 'corpus_by_page.csv'))  # optional, for reference
    df_by_sentence.to_csv(os.path.join(DATA_DIR, 'processed', 'corpus_by_sentence.csv'), index=False)

The file words_to_replace.json in the config folder is used for ad-hoc text cleaning. If using an AWS S3 bucket for storing PDFs, please set from_s3_bucket=True, so the scraper will be

scraper = DocumentScraper(pdfs_folder, json_path, from_s3_bucket = True)

If you choose this option, the pdfs_folder can either start with "s3://" or omit this prefix. Please ensure you already have the correct credentials for accessing the S3 bucket you are pointing to, and that the path is correct.
When running scraper.document_corpus_to_pandas_df(), the json is deserialised into a python dictionary, and the corpus text will be cleaned by replacing each key in this dictionary with its value. In order to modify and customize the content of this json file, run the script pdf2emb_nlp/ and adapt it as necessary.

Once you have created a file corpus_by_sentence.csv, you can embed the sentences in this file using your model of choice out of Word2Vec (with the option to include Tf-Idf weights), ELMo and BERT. For each model, sentence-level embeddings are generated. Where the original model would generate word-level embeddings, sentence-level embeddings have been created by averaging all the word embeddings of the respective sentence. The script is an example of how you could run all 4 NLP models and save them separately. It imports the embedder module.

import os
import json
import yaml
import logging.config
import pandas as pd
from pdf2emb_nlp.embedder import Embedder

models_to_be_run = [
    'Word2Vec_tfidf_weighted',  # comment out as needed

if __name__ == '__main__':
    DATA_DIR = os.getenv('DATA_DIR')
    MODELS_DIR = os.getenv('MODELS_DIR')
    CONFIG_DIR = os.getenv('CONFIG_DIR')

    with open(LOGGING_CONFIG, 'r') as f:
        config = yaml.safe_load(f)

    with open(os.path.join(CONFIG_DIR, 'filenames.json'), 'r') as f:
        file_names = json.load(f)

    corpus_filename = "corpus_by_sentence.csv"
    corpus_by_sentence = pd.read_csv(os.path.join(DATA_DIR, "processed", corpus_filename))
    list_of_sentences = corpus_by_sentence['sentence'].values.tolist()
    print("Instantiating Embedder class.")
    embedder = Embedder(list_of_sentences)

    for model in models_to_be_run:
        print(f"Calculating {model} embeddings.")
        if model == 'Word2Vec_tfidf_weighted':
            sentence_embeddings, model_obj, tfidf_vectorizer = embedder.compute_word2vec_embeddings(tfidf_weights=True)
            embedder.save_model(tfidf_vectorizer, MODELS_DIR, file_names[model]['vectorizer_filename'])
            # the line above is specific to Word2Vec with TfIdf vectorizer and cannot be generalized to other models
        elif model == 'Word2Vec':
            sentence_embeddings, model_obj, _ = embedder.compute_word2vec_embeddings(tfidf_weights=False)
        elif model == 'BERT':
            bert_model = 'bert-base-nli-stsb-mean-tokens'  # This line is specific to BERT
            sentence_embeddings, model_obj = embedder.compute_bert_embeddings(bert_model)
        elif model == 'ELMo':
            sentence_embeddings, model_obj = embedder.compute_elmo_embeddings()
            raise KeyError(f'The model {model} is not recognized as input.')
        print(f"{model} embeddings calculated. Saving model.")
        embedder.save_embeddings(sentence_embeddings, MODELS_DIR, file_names[model]['embeddings_filename'])
        embedder.save_model(model_obj, MODELS_DIR, file_names[model]['model_filename'])
        print(f"{model} model saved. Saving .parquet file.")
        df = embedder.add_embeddings_to_corpus_df(
            os.path.join(DATA_DIR, "processed", corpus_filename), sentence_embeddings, file_names[model]['column_name']
        embedder.df_to_parquet(df, os.path.join(DATA_DIR, "processed", file_names[model]['parquet_filename']))
        print(f"Parquet file saved. All steps done for the {model} model.")

Each model has been saved as a .pickle file in the models folder, each model's embeddings as a .npy file in the models folder, and each pd.DataFrame as a .parquet file in the data/processed folder. Each .parquet file contains the same data as the corpus_by_sentence.csv file previously saved, with an added column, representing the sentence embeddings for the chosen model. A separate .parquet has been saved for each model, although the user may modify the script above to save all models' embeddings in the same .parquet file. The file names of the .pickle, .npy and .parquet files are stored in the filenames.json in the config folder. In order to modify and customize these names, run the script pdf2emb_nlp/ and adapt it as necessary.

Finally, in order to search through your corpus of PDF files given a user search query (which can be a single word or a few words), run the script in the scripts folder, which imports the process_user_queries module:

import os
import yaml
import json
import logging.config
from pdf2emb_nlp.process_user_queries import query_embeddings

if __name__ == '__main__':
    user_search_input = 'cell phone'
    model_name = 'BERT'  # change as appropriate
    DATA_DIR = os.getenv("DATA_DIR")
    CONFIG_DIR = os.getenv('CONFIG_DIR')
    MODELS_DIR = os.getenv("MODELS_DIR")
    with open(LOGGING_CONFIG, 'r') as f:
        config = yaml.safe_load(f)
    with open(os.path.join(CONFIG_DIR, 'filenames.json'), 'r') as f:
        file_names = json.load(f)

    tfidf_vectorizer = os.path.join(MODELS_DIR, "tfidf_vectorizer.pickle")

    model = os.path.join(MODELS_DIR, file_names[model_name]["model_filename"])  # this is optional for ELMo and BERT.
    trained_df_path = os.path.join(DATA_DIR, 'processed', file_names[model_name]["parquet_filename"])
    user_input_embedding, trained_df = query_embeddings(
        user_search_input, trained_df_path, file_names[model_name]["column_name"], model_name, model,
        distance_metric='cosine', tfidf_vectorizer=tfidf_vectorizer
    # tfidf_vectorizer is not used (and optional) when model is not 'Word2Vec_TfIdf_weighted'
    if user_input_embedding.size and not trained_df.empty:  # they must not be empty
        print(trained_df.sort_values('metric_distance', ascending=True)[['sentence', 'metric_distance']].

At this point, the user_input_embedding is the embedding of the user search query, and trained_df is the pd.DataFrame containing a column with the metric distance between the user embedding and each individual sentence embedding in the corpus (default metric: cosine similarity). If you want to visualise the most similar sentences to the user search query, you can simply sort the pd.DataFrame by its metric_distance column.

print(trained_df.sort_values('metric_distance', ascending=True)[['sentence', 'metric_distance']].

Project description adapted from the cookiecutter data science project template. #cookiecutterdatascience

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2emb_nlp-0.2.0.tar.gz (23.7 kB view hashes)

Uploaded Source

Built Distribution

pdf2emb_nlp-0.2.0-py3-none-any.whl (19.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page