Parse NexisUni rtf files into a jsonlines file.
Project description
NexisUni Parser
This module can be used to convert Nexis Uni rich-text files to a tabular format.
Usage
There are three main functions that this package provides.
Convert an RTF file to plain text
Converting an RTF file to a plain text file can be achieved more directly by using pandoc. That said, I have included a function that will convert an RTF file to a plain text file. Under the hood it just uses pandoc.
Parse Nexis Uni Files
The result of parsing a nexisuni file is a gzipped JSON lines file. This can be read easily using pandas. I choose to convert to a compressed JSON lines file because the text data can get rather large. Writing it to Excel directly would add a dependency and would force all the data to be read into memory before writing the file. By streaming it directly into a JSON lines file, the memory consumption stays relatively low.
from pathlib import Path
from nexisuni_parser import parse
inputfile = Path.home().joinpath("nexisuni-file.rtf")
output_filepath = parse(inputfile)
# Reading the data into a pandas dataframe is easy from here.
import pandas as pd
nexisuni_df = pd.read_json(str(output_filepath), compression="gzip", lines=True)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nexis_uni_parser-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 150cf9bdd912c694069a5efb9c193616eabaa21ea91fb82ea07987dab2a813c5 |
|
MD5 | d5484b859acc3196abdcbe875a0dbad4 |
|
BLAKE2b-256 | 5a8fcf0f5f47ec23823bb32e22b56baf7f1659ac30d07263112ba6c1ee81d2c0 |