Skip to main content

Package for creating document embeddings of patents and analysis tools.

Project description

Patent breakthrough

tests

The code in this repository is used to identify breakthrough innovations in historical patents from the USPTO. The docembedding Python package contains a variety of methods for creating document embeddings. We have optimized and tested these methods for their ability to predict similarity between patents. This was done by maximizing the cosine similarity between patents that are classified into the same technology class, and minimizing cosine similarity between patents that fall into different technology classes. These methods with optimized parameters are then used to create document embeddings. From these embeddings, novelty scores are created using cosine similarities between the focal patent and patents in the previous n years and subsequent n years.

Getting Started

Clone this repository to your working station to obtain example notebooks and python scripts:

git clone https://github.com/UtrechtUniversity/patent-breakthrough.git

Prerequisites

To install and run this project you need to have the following prerequisites installed.

- Python [>=3.8, <3.11]
- jupyterlab (or any other program to run jupyter notebooks)

To install jupyterlab:

pip install jupyterlab

Installation

To run the project, ensure to install the project's dependencies

pip install git+https://github.com/UtrechtUniversity/patent-breakthrough.git

Built with

These packages are automatically installed in the step above:

Usage

1. Preparation

First you need to make sure that you have the data prepared. There should be a directory with *.xz files, which should have the year, so 1923.xz, 1924.xz, 1925.xz, etc. If this is not the case and you have only the raw .txt files, then you have to compress your data:

from docembedder.preprocessor.parser import compress_raw
compress_raw(some_file_name, "year.csv", some_output_dir)

Here, "year.csv" should be a file that that contains the patent ids and the year in which they were issued.

2. Hyper parameter optimization

There are procedures to optimize the preprocessor and ML models with respect to predicting CPC classifications. This is not a necessary step to compute the novelties and impacts, and has already been done for patents 1838-1951. For more information on how to optimize the models, see the documentation.

3. Preprocessing

To improve the quality of the patents, and process/remove the start sections and such, it is necessary to preprocess these raw files. This is done using the Preprocessor and OldPreprocessor classes, for example:

from docembedder.preprocessor import Preprocessor, OldPreprocessor

prep = Preprocessor()
old_prep = OldPreprocessor()
documents = prep.preprocess_file("1928.xz")

Normally however, we do not need to do preprocessing as a seperate step. We can compute the embeddings directly, which is explained in the next section.

4. Embedding models

There are 5 different embedding models implemented to compute the embeddings:

from docembedder.models import CountVecEmbedder, D2VEmbedder, BPembEmbedder
from docembedder.models import TfidfEmbedder, BERTEmbedder
model = BERTEmbedder()
model.fit(documents)
embeddings = model.transform(documents)

These models can have different parameters for training, see the section on hyper parameter models. The result can be either sparse or dense matrices. The functions and methods in this package work with either in the same way.

5. Computing embeddings

The prepared data can be analysed to compute the embeddings for each of the patents using the run_models function. This function has the capability to run in parallel, in case you have more than one core on your CPU for examples.

Before we can run, we have to tell docembedder the parameters of the run, which is done through the SimulationSpecification class:

from docembedder.utils import SimulationSpecification
sim_spec = SimulationSpecification(
    year_start=1838,  # Starting year for computing the embeddings.
    year_end=1951, # Last year for computing the embeddings.
    window_size=21,  # Size of the window to compute the embeddings for.
    window_shift=1,  # How many years between subsequent windows.
    debug_max_patents=100  # For a trial run we sample the patents instead, remove for final run.
)

An example to create a file with the embeddings is:

from docembedder.utils import run_models
run_models({"bert": BERTEmbedder()}, model, sim_spec, output_fp, cpc_fp)

The output file is then a HDF5 file, which stores the embeddings for all patents in all windows.

6. Computing novelty and impact

To compute the novelty and impact we're using the Analysis class:

from docembedder.analysis import DocAnalysis
with DataModel(output_fp, read_only=False) as data
    analysis = DocAnalysis(data)
    results = analysis.compute_impact_novelty("1920-1940", "bert")

The result is a dictionary that contains the novelties and impacts for each of the patents in that window (in this case 1920-1940).

About the Project

Date: February 2023

Researcher(s):

Research Software Engineer(s):

  • Raoul Schram
  • Shiva Nadi
  • Maarten Schermer
  • Casper Kaandorp
  • Jelle Treep (h.j.treep@uu.nl)

License

The code in this project is released under MIT license.

Attribution and academic use

Manuscript in preparation

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

To contribute:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

Benjamin Cornejo Costas - b.j.cornejocostas@uu.nl

Project Link: https://github.com/UtrechtUniversity/patent-breakthrough

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docembedder-0.1.0.tar.gz (17.2 MB view details)

Uploaded Source

Built Distribution

docembedder-0.1.0-py3-none-any.whl (49.0 kB view details)

Uploaded Python 3

File details

Details for the file docembedder-0.1.0.tar.gz.

File metadata

  • Download URL: docembedder-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for docembedder-0.1.0.tar.gz
Algorithm Hash digest
SHA256 088a27fb6c272db894bbd1132759d8ab9fd8d7db60b3243d006d446d9f3990d3
MD5 f3047d1cda3517b7f58b0dc0dbbcd433
BLAKE2b-256 4544decc5832051c1eca041098754a0bf88ff18ba3ac1e9575bf9ed41fb9d39a

See more details on using hashes here.

File details

Details for the file docembedder-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docembedder-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for docembedder-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 02d7caec0f11d64818585f34c67efc97f3c1b00414ae1badbed7d0fcf6003e5e
MD5 4e326b58c6a00990fa4c288904c315dc
BLAKE2b-256 08ac7078ad100ff77ed158cc8ce734a64a660364dfebdc173b8e4257da7cd765

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page