Skip to main content

A tool for automating the process of extracting relevant information from text documents

Project description

Parseidon

Parseidon is a document parsing text extracting tool written in Python. The purpose of parseidon is to let the user extract strings that match a desired predefined format using either regex or PEG for pattern matching. Additionally the filter mode of parseidon uses vocabulary data to filter out common words, leaving uncommon strings that might be of interest. The pattern matching and the filtering functionality can also be used together in the find mode, letting the filter assist the user in identifying words not covered by their regexes or PEGs

Modes

Parseidon consists of four separate modes:

  • regex_mode performs pattern matching on the document strings using regular expressions.
    • A more detailed description can be found here regex_mode
  • pegparse_mode essentially has the same functionality as regex_mode except it utilizes parsing expression grammar(PEG) rules to find matches.
  • filter_modefilters out common dictionary items, leaving the unrecognized potentially interesting words for manual inspection by the user.
  • A more detailed description can be found here filter_mode
  • find_modecombines the functionality of filter_mode with either regex_mode or pegparse_mode, highlighting both pattern matches and unrecognized strings.
  • A more detailed description can be found here find_mode

Plugins

The project includes plugins in addition to the core project. Below follows a list of implemented plugins.

  • parseidon-headings-plugin

    • Removes numbered headings that could falsely be identified as IPv4-adresses
  • parseidon-hyphen-plugin

    • Determines if a word containing a hyphen is correct or if the hyphen exists only due to the line width being exceeded by the word.

These are described in more detail in headings_plugin and hyphen_plugin.

Documentation

In addition to this document, the project includes a documentation folder which contain information about installation, usage, plugins and language resources.

Contact

For questions, feedback, or general inquiries, please contact us at parseidon@foi.se.

Data attribution

For attribution of language resources used in this project, please refer to third party notices. For information on how the respective sources are used, please see language resources.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parseidon-2.3.3.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parseidon-2.3.3-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file parseidon-2.3.3.tar.gz.

File metadata

  • Download URL: parseidon-2.3.3.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for parseidon-2.3.3.tar.gz
Algorithm Hash digest
SHA256 9123f5d354b4b3ffcc10de90b7683759d78501d21827656f30989b81d518a70d
MD5 c06a010f5bd613810eb25801fb4372f7
BLAKE2b-256 b3c2e47960be8650b3b2d0e871daa80fbfd8022e630e24a62c2677e2d832362a

See more details on using hashes here.

Provenance

The following attestation bundles were made for parseidon-2.3.3.tar.gz:

Publisher: release.yml on CrateOrg/parseidon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parseidon-2.3.3-py3-none-any.whl.

File metadata

  • Download URL: parseidon-2.3.3-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for parseidon-2.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0e34b0509ab5011e76d2b926a51b1921164bb2217fef1f102ceae16e33a41098
MD5 1cb2d36544ce25f6a6567ea27e061bc1
BLAKE2b-256 32e87ae0f1b6601dbd607f953b1f16cb49aede56b3dec2913041c41e324a6600

See more details on using hashes here.

Provenance

The following attestation bundles were made for parseidon-2.3.3-py3-none-any.whl:

Publisher: release.yml on CrateOrg/parseidon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page