A tool for automating the process of extracting relevant information from text documents
Project description
Parseidon
Parseidon is a document parsing text extracting tool written in Python. The purpose of parseidon is to let the user extract strings that match a desired predefined format using either regex or PEG for pattern matching. Additionally the filter mode of parseidon uses vocabulary data to filter out common words, leaving uncommon strings that might be of interest. The pattern matching and the filtering functionality can also be used together in the find mode, letting the filter assist the user in identifying words not covered by their regexes or PEGs
Modes
Parseidon consists of four separate modes:
regex_modeperforms pattern matching on the document strings using regular expressions.- A more detailed description can be found here regex_mode
pegparse_modeessentially has the same functionality asregex_modeexcept it utilizes parsing expression grammar(PEG) rules to find matches.- A more detailed description can be found here pegparse_mode
filter_modefilters out common dictionary items, leaving the unrecognized potentially interesting words for manual inspection by the user.
- A more detailed description can be found here filter_mode
find_modecombines the functionality offilter_modewith eitherregex_modeorpegparse_mode, highlighting both pattern matches and unrecognized strings.
- A more detailed description can be found here find_mode
Plugins
The project includes plugins in addition to the core project. Below follows a list of implemented plugins.
-
parseidon-headings-plugin
- Removes numbered headings that could falsely be identified as IPv4-adresses
-
parseidon-hyphen-plugin
- Determines if a word containing a hyphen is correct or if the hyphen exists only due to the line width being exceeded by the word.
These are described in more detail in headings_plugin and hyphen_plugin.
Documentation
In addition to this document, the project includes a documentation folder which contain information about installation, usage, plugins and language resources.
Contact
For questions, feedback, or general inquiries, please contact us at parseidon@foi.se.
Data attribution
For attribution of language resources used in this project, please refer to third party notices. For information on how the respective sources are used, please see language resources.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parseidon-2.3.3.tar.gz.
File metadata
- Download URL: parseidon-2.3.3.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9123f5d354b4b3ffcc10de90b7683759d78501d21827656f30989b81d518a70d
|
|
| MD5 |
c06a010f5bd613810eb25801fb4372f7
|
|
| BLAKE2b-256 |
b3c2e47960be8650b3b2d0e871daa80fbfd8022e630e24a62c2677e2d832362a
|
Provenance
The following attestation bundles were made for parseidon-2.3.3.tar.gz:
Publisher:
release.yml on CrateOrg/parseidon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parseidon-2.3.3.tar.gz -
Subject digest:
9123f5d354b4b3ffcc10de90b7683759d78501d21827656f30989b81d518a70d - Sigstore transparency entry: 1342527380
- Sigstore integration time:
-
Permalink:
CrateOrg/parseidon@8c324edeccb4fd9466801e13f857c82147ad7cb7 -
Branch / Tag:
refs/tags/v2.3.3 - Owner: https://github.com/CrateOrg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8c324edeccb4fd9466801e13f857c82147ad7cb7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file parseidon-2.3.3-py3-none-any.whl.
File metadata
- Download URL: parseidon-2.3.3-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e34b0509ab5011e76d2b926a51b1921164bb2217fef1f102ceae16e33a41098
|
|
| MD5 |
1cb2d36544ce25f6a6567ea27e061bc1
|
|
| BLAKE2b-256 |
32e87ae0f1b6601dbd607f953b1f16cb49aede56b3dec2913041c41e324a6600
|
Provenance
The following attestation bundles were made for parseidon-2.3.3-py3-none-any.whl:
Publisher:
release.yml on CrateOrg/parseidon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parseidon-2.3.3-py3-none-any.whl -
Subject digest:
0e34b0509ab5011e76d2b926a51b1921164bb2217fef1f102ceae16e33a41098 - Sigstore transparency entry: 1342527384
- Sigstore integration time:
-
Permalink:
CrateOrg/parseidon@8c324edeccb4fd9466801e13f857c82147ad7cb7 -
Branch / Tag:
refs/tags/v2.3.3 - Owner: https://github.com/CrateOrg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8c324edeccb4fd9466801e13f857c82147ad7cb7 -
Trigger Event:
push
-
Statement type: