Skip to main content

Create parsers

Project description

parserator

A toolkit for making domain-specific probabilistic parsers

Build Status

Do you have domain-specific text data that would be much more useful if you could derive structure from the strings? This toolkit will help you create a custom NLP model that learns from patterns in real data and then uses that knowledge to process new strings automatically. All you need is some training data to teach your parser about its domain.

What does probabilistic parser do?

Given a string, a probabilistic parser will break it out into labeled components. The parser uses conditional random fields to label components based on (1) features of the component string and (2) the order of labels.

When is a probabilistic parser useful?

A probabilistic parser is particularly useful for sets of strings that may have common structure/patterns, but which deviate from those patterns in ways that are difficult to anticipate with hard-coded rules.

For example, in most cases, addresses in the United States start with a street number. But there are exceptions: sometimes valid U.S. addresses deviate from this pattern (e.g., addresses starting with a building name or a P.O. box). Furthermore, addresses in real data sets often include typos and other errors. Because there are infinitely many patterns and possible typos to account for, a probabilistic parser is well-suited to parse U.S. addresses.

With a probabilistic (as opposed to a rule-based approach) approach, the parser can continually learn from new training data and thus continually improve its performance!

Some other examples of domains where a probabilistic parser can be useful:

  • addresses in other countries with unfamiliar conventions
  • product names/descriptions (e.g., parsing phrases like "Twizzlers Twists, Strawberry, 16-Ounce Bags (Pack of 6)" into brand, item, flavor, weight, etc.)
  • citations in academic writing

Examples of parserator

Try out these parsers on our web interface!

How to make a parser - quick overview

For more details on each step, see the parserator documentation.

  1. Initialize a new parser

    pip install parserator
    parserator init [YOUR PARSER NAME]
    python setup.py develop
    
  2. Configure the parser to your domain

    • configure labels (i.e., the set of possible tags for the tokens)
    • configure the tokenizer (i.e., how a raw string will be split into a sequence of tokens to be tagged)
  3. Define features relevant to your domain

    • define token-level features (e.g., length, casing)
    • define sequence-level features (e.g., whether a token is the first token in the sequence)
  4. Prepare training data

    • Parserator reads training data in XML format
    • To create XML training data output from unlabeled strings in a CSV file, use parserator's command line interface to manually label tokens. It uses values in first column, and it ignores other columns. To start labeling, run parserator label [infile] [outfile] [modulename]
    • For example, parserator label unlabeled/rawstrings.csv labeled_xml/labeled.xml usaddress
  5. Train your parser

    • To train your parser on your labeled training data, run parserator train [traindata] [modulename]
    • For example, parserator train labeled_xml/labeled.xml usaddress or parserator train "labeled_xml/*.xml" usaddress
    • After training, your parser will have an updated model, in the form of a .crfsuite settings file
  6. Repeat steps 3-5 as needed!

How to use your new parser

Once you are able to create a model from training data, install your custom parser by running python setup.py develop.

Then, in a Python shell, you can import your parser and use the parse and tag methods to process new strings. For example, to use the probablepeople module:

>>> import probablepeople
>>> probablepeople.parse('Mr George "Gob" Bluth II')
[('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]

Important Links

Team

Errors and Bugs

If something is not behaving intuitively, it is a bug and should be reported. Report an issue.

Patches and Pull Requests

We welcome your ideas! You can make suggestions in the form of GitHub issues (bug reports, feature requests, general questions), or you can submit a code contribution via a pull request.

How to contribute code:

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request with a description of your work! Don't worry if it isn't perfect: think of a PR as a start of a conversation rather than a finished product.

Copyright and Attribution

Copyright (c) 2016 DataMade. Released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parserator-0.6.9.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

parserator-0.6.9-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file parserator-0.6.9.tar.gz.

File metadata

  • Download URL: parserator-0.6.9.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for parserator-0.6.9.tar.gz
Algorithm Hash digest
SHA256 f7dfd2d41588aaafa485da6f7f1107ac8e5997f0cabed9a65e1302bfa7af480a
MD5 b88fe9bca7bfdec98c8a69b89866cfa0
BLAKE2b-256 2b7149b1df4cef28393977fc46e1832cda810021fd1eefbd53c04dc3ca765e9c

See more details on using hashes here.

File details

Details for the file parserator-0.6.9-py3-none-any.whl.

File metadata

  • Download URL: parserator-0.6.9-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for parserator-0.6.9-py3-none-any.whl
Algorithm Hash digest
SHA256 310dff77dd0cc1d2e204fbabdb5d9e95707618a45bdc4fc701b0e858d0997a99
MD5 bfe0b820ec83a6365e8f9f946d17a53d
BLAKE2b-256 59de77c8cfc27fb021fd8ab99d6b8aeb7bba39862c31e9bf00de67b5a81c9bfa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page