patent-parsing-tools is a library providing tools for generating training and test set from Google's USPTO data helpful with for testing machine learning algorithms
Project description
patent-parsing-tools
USPTO patents dataset generator.
Documentation
System requirements
sudo yum install python-devel libxslt-devel libxml2-devel
Installation:
pip install patent-parsing-tools
Examples:
Downloading dataset:
python -m patent_parsing_tools.downloader \
--directory dataset \
--year-from 2010 \
--year-to 2010
Collecting and serializing data:
python -m patent_parsing_tools.supervisor \
--working-directory patents/working_directory \
--train-destination patents/train_destination \
--test-destination patents/test_destination \
--year-from 2014 \
--year-to 2015
Generating dictionary with train set:
python -m patent_parsing_tools.bow.dictionary_maker \
--train-directory patents/train_destination \
--max-patents 1000000000 \
--dictionary dictionary.txt \
--dict-max-size 4096
Generate bag of words with train set and test set:
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/train_destination \
--destination-directory patents/final_dataset_train \
--dictionary dictionary.txt \
--batch-size 1048576
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/test_destination \
--destination-directory patents/final_dataset_test \
--dictionary dictionary.txt \
--batch-size 1048576
Testing
pytest
Contributing and develpment
$ mkvirtualenv ppt
$ workon ppt
(ppt) $ pip install -r requirements.txt
Publish new release
$ git tag v1.0
$ git push origin v1.0
Building documentation
(ppt) $ sphinx-build -M html docs docs_build
References
Usage:
- Elton, Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora, 2019, online: https://arxiv.org/abs/1903.00415.
- Lee, Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review, 2023, online: https://doi.org/10.1007/s40684-023-00523-6.
License
The MIT License (MIT). Copyright (c) 2014 Michał Dul, Piotr Przetacznik, Krzysztof Strojny. Check LICENSE files for more information.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file patent-parsing-tools-0.9.5.tar.gz
.
File metadata
- Download URL: patent-parsing-tools-0.9.5.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a4c2da98468fde1c87ca20d01cc1988b077e9a5493588b2e192f22e9c7883ef |
|
MD5 | 4fe1f2bf42c6a2f3fb84b245ef36f67f |
|
BLAKE2b-256 | f0180b8a5cbd4e2fb669e2c34b0c10839bf089ded88e02ff94e4f80c1b749896 |
File details
Details for the file patent_parsing_tools-0.9.5-py3-none-any.whl
.
File metadata
- Download URL: patent_parsing_tools-0.9.5-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bb52a2deaaaec6faa49ac3d78f59f959189d4e4215a15776e5cadbb40dd3802 |
|
MD5 | 07c93439d0b248945ae9db1e24e62b20 |
|
BLAKE2b-256 | 65f78b2f6a6f49f85107f3660af525809cb6714a8b61fb9673b09dfc8c0f8b40 |