Skip to main content

Parse NexisUni rtf files into a jsonlines file.

Project description

NexisUni Parser

This module can be used to convert Nexis Uni rich-text files to a tabular format.

Usage

There are three main functions that this package provides.

Convert an RTF file to plain text

Converting an RTF file to a plain text file can be achieved more directly by using pandoc. That said, I have included a function that will convert an RTF file to a plain text file. Under the hood it just uses pandoc.

Parse Nexis Uni Files

The result of parsing a nexisuni file is a gzipped JSON lines file. This can be read easily using pandas. I choose to convert to a compressed JSON lines file because the text data can get rather large. Writing it to Excel directly would add a dependency and would force all the data to be read into memory before writing the file. By streaming it directly into a JSON lines file, the memory consumption stays relatively low.

from pathlib import Path
from nexisuni_parser import parse

inputfile = Path.home().joinpath("nexisuni-file.rtf")

output_filepath = parse(inputfile)

# Reading the data into a pandas dataframe is easy from here.

import pandas as pd

nexisuni_df = pd.read_json(str(output_filepath), compression="gzip", lines=True)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nexis-uni-parser-0.1.2.tar.gz (4.4 kB view hashes)

Uploaded Source

Built Distribution

nexis_uni_parser-0.1.2-py3-none-any.whl (4.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page