Parse NexisUni rtf files into a jsonlines file.
Project description
Nexis Uni Parser
This package can be used to convert NexisUni richtext files to jsonlines format.
Features
- TODO
Requirements
- TODO
Installation
You can install Nexis Uni Parser via pip from PyPI:
pip install nexis-uni-parser
Usage
There are two main functions that this package provides.
Convert an RTF file to plain text
Converting an RTF file to a plain text file can be achieved directly by using pandoc. That said, I have included a function that will convert an RTF file to a plain text file since it could be useful. Under the hood, it just uses pandoc.
from pathlib import Path
from nexis_uni_parser import convert_rtf_to_plain_text
inputfile = Path.home().joinpath("nexisuni-file.rtf")
output_filepath = convert_rtf_to_plain_text(inputfile)
print(output_filepath)
>>> /Users/name/nexisuni-file.txt
Parse Nexis Uni Files
The parse
function can be used to parse a single file or a directory. Both produce a gzipped JSON lines file. I choose to convert to a compressed JSON lines file because the text data can get large if all files are read into memory.
from pathlib import Path
from nexis_uni_parser import parse
inputfile = Path.home().joinpath("nexisuni-file.rtf")
output_filepath = parse(inputfile)
# Reading the data into a pandas dataframe is easy from here.
import pandas as pd
nexisuni_df = pd.read_json(str(output_filepath), compression="gzip", lines=True)
Contributing
Contributions are very welcome. To learn more, see the Contributor Guide.
License
Distributed under the terms of the MIT license, Nexis Uni Parser is free and open source software.
Issues
If you encounter any problems, please file an issue along with a detailed description.
Credits
This project was generated from @cjolowicz's Hypermodern Python Cookiecutter template.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nexis_uni_parser-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d517eaae662e955675f39fc8dad87fac537481a43a236e81682a05f7d7e15ff |
|
MD5 | 1c5b57bb59cdd160f0bda41485b72d5a |
|
BLAKE2b-256 | bf0eb2a35cf5c9d19da3cf6818d6ee4a6a3b47287218b426591eede3641cdd46 |