Skip to main content

Library with NLP Algorithms implemented from scratch

Project description

Logo

Scratch NLP 🧠

Library with foundational NLP Algorithms implemented from scratch using PyTorch.

Table of Contents 📋

Documentation 📝

Documentation

Installation ⬇️

Install using pip

   pip install scratch-nlp

Install Manually for development

Clone the repo

  gh repo clone shanmukh05/scratch_nlp

Install dependencies

  pip install -r requirements.txt

Features 🛠️

  • Algorithms

    • Bag of Words
    • Ngram
    • TF-IDF
    • Hidden Markov Model
    • Word2Vec
    • GloVe
    • RNN (Many to One)
    • LSTM (One to Many)
    • GRU (Many to Many Synced)
    • Seq2Seq + Attention (Many to Many)
    • Transformer
    • BERT
    • GPT-2
  • Tokenization

    • BypePair Encoding
    • WordPiece Tokenizer
  • Metrics

    • BLEU
    • ROUGE (-N, -L, -S)
    • Perplexity
    • METEOR
    • CIDER
  • Datasets

    • IMDB Reviews Dataset
    • Flickr Dataset
    • NLTK POS Datasets (treebank, brown, conll2000)
    • SQuAD QA Dataset
    • Genius Lyrics Dataset
    • LAMBADA Dataset
    • Wiki en dataset
    • English to Telugu Translation Dataset
  • Tasks

    • Sentiment Classification
    • POS Tagging
    • Image Captioning
    • Machine Translation
    • Question Answering
    • Text Generation

Implementation Details

Algorithm Task Tokenization Output Dataset
BOW Text Representation Preprocessed words
  • Text Label, Vector npy files
  • Top K Vocab Frequency Histogram png
  • Vocab frequency csv
  • Wordcloud png
IMDB Reviews
Ngram Text Representation Preprocessed Words
  • Text Label, Vector npy files
  • Top K Vocab Frequency Histogram png
  • Top K ngrams Piechart ong
  • Vocab frequency csv
  • Wordcloud png
IMDB Reviews
TF-IDF Text Representation Preprocessed words
  • Text Label, Vector npy files
  • TF PCA Pairplot png
  • TF-IDF PCA Pairplot png
  • IDF csv
IMDB Reviews
HMM POS Tagging Preprocessed words
  • Data Analysis png (sent len, POS tags count)
  • Emission Matrix TSNE html
  • Emission matrix csv
  • Test Predictions conf matrix, clf report png
  • Transition Matrix csv, png
NLTK Treebank
Word2Vec Text Representation Preprocessed words
  • Best Model pt
  • Training History json
  • Word Embeddings TSNE html
IMDB Reviews
GloVe Text Representation Preprocessed words
  • Best Model pt
  • Training History json
  • Word Embeddings TSNE html
  • Top K Cooccurence Matrix png
IMDB Reviews
RNN Sentiment Classification Preprocessed words
  • Best Model pt
  • Training History json
  • Word Embeddings TSNE html
  • Confusion Matrix png
  • Training History png
IMDB Reviews
LSTM Image Captioning Preprocessed words
  • Best Model pt
  • Training History json
  • Word Embeddings TSNE html
  • Training History png
Flickr 8k
GRU POS Tagging Preprocessed words
  • Best Model pt
  • Training History json
  • Word Embeddings TSNE html
  • Confusion Matrix png
  • Test predictions csv
  • Training History png
NLTK Treebank, Broown, Conll2000
Seq2Seq + Attention Machine Translation Tokenization
  • Best Model pt
  • Training History json
  • Source, Target Word Embeddings TSNE html
  • Test predictions csv
  • Training History png
English to Telugu Translation
Transformer Lyrics Generation BytePairEncoding
  • Best Model pt
  • Training History json
  • Token Embeddings TSNE html
  • Test predictions csv
  • Training History png
Genius Lyrics
BERT NSP Pretraining, QA Finetuning WordPiece
  • Best Model pt (pretrain, finetune)
  • Training History json (pretrain, finetune)
  • Token Embeddings TSNE html
  • Finetune Test predictions csv
  • Training History png (pretrain, finetune)
Wiki en, SQuAD v1
GPT-2 Sentence Completition BytePairEncoding
  • Best Model pt
  • Training History json
  • Token Embeddings TSNE html
  • Test predictions csv
  • Training History png
LAMBADA

Examples 🌟

Run Train and Inference directly through import

import yaml
from scratch_nlp.src.core.gpt import gpt

with open(config_path, "r") as stream:
  config_dict = yaml.safe_load(stream)

gpt = gpt.GPT(config_dict)
gpt.run()

Run through CLI

  cd src
  python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'

Contributing 🤝

Contributions are always welcome!

See CONTRIBUTING.md for ways to get started.

Acknowledgements 💡

I have referred to so many online resources to create this project. I'm adding all the resources to RESOURCES.md. Thanks to all who has created those blogs/code/datasets 😊.

Thanks to CS224N course which gave me motivation to start this project

About Me 👤

I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.

Connect with me

@shanmukh05

Lessons Learned 📌

Most of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project

  • NLP Algorithms
  • Research paper Implementation
  • Designing Project structure
  • Documentation
  • GitHub pages
  • PIP packaging

License ⚖️

MIT License

Feedback 📣

If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scratchnlp-1.0.0.tar.gz (62.9 kB view details)

Uploaded Source

Built Distribution

ScratchNLP-1.0.0-py3-none-any.whl (90.4 kB view details)

Uploaded Python 3

File details

Details for the file scratchnlp-1.0.0.tar.gz.

File metadata

  • Download URL: scratchnlp-1.0.0.tar.gz
  • Upload date:
  • Size: 62.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for scratchnlp-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f993cbe16c62e277c3bc12619c6465530eb72c3901c23f75d053668b99582c6c
MD5 74c4531a5d0536475e0fd5faa38c13de
BLAKE2b-256 dba46028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286

See more details on using hashes here.

Provenance

The following attestation bundles were made for scratchnlp-1.0.0.tar.gz:

Publisher: python_package.yml on shanmukh05/scratch_nlp

Attestations:

File details

Details for the file ScratchNLP-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ScratchNLP-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 90.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ScratchNLP-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 483ce8b05611202931059f0f4152b6650065d61d150c9f32ca7573ccdcc0022d
MD5 d6f811b0f66bbf6368dc84d8271245ca
BLAKE2b-256 57734b82d2613c12003664f9ba9faa420497b61f8bbd9a655857039b16080d94

See more details on using hashes here.

Provenance

The following attestation bundles were made for ScratchNLP-1.0.0-py3-none-any.whl:

Publisher: python_package.yml on shanmukh05/scratch_nlp

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page