Skip to main content

Creates a SQLite database if the CNN and DailyMail summarization dataset.

Project description

CNN/DailyMail Dataset as SQLite

PyPI Python 3.9 Python 3.10 Build Status

Creates a SQLite database if the CNN and DailyMail summarization dataset.

Documentation

See the full documentation. The API reference is also available.

Obtaining

The easiest way to install the command line program is via the pip installer:

pip3 install zensols.cnndmdb

Binaries are also available on pypi.

Usage

First create the SQLite database file: cnndmdb load and check to make sure the file data/cnn.sqlite3 was created. This takes a while since the entire corpus is first downloaded and then inserted into the SQLite file.

Command Line

The SQLite database keys can be given:

cnndmdb keys

Then the command line can also be used to print articles:

cnndmdb show -t org 3b07f5102c69e3e609d73b2ccb0dc5549d4fbaf6

The -t org tells it to use the original corpus keys. This option also allows for selected SQLite rowid keys or a Kth smallest article.

API

The corpus objects are accessible as mapped Python objects. For example:

corpus: Corpus = ApplicationFactory.get_corpus()
art: Article = next(iter(corpus.stash.values()))
print(art.text)

Data Source

The data is sourced from a Tensorflow dataset, which in turn uses the Abigail See GitHub repository.

@article{DBLP:journals/corr/SeeLM17,
  author    = {Abigail See and
               Peter J. Liu and
               Christopher D. Manning},
  title     = {Get To The Point: Summarization with Pointer-Generator Networks},
  journal   = {CoRR},
  volume    = {abs/1704.04368},
  year      = {2017},
  url       = {http://arxiv.org/abs/1704.04368},
  archivePrefix = {arXiv},
  eprint    = {1704.04368},
  timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{hermann2015teaching,
  title={Teaching machines to read and comprehend},
  author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
  booktitle={Advances in neural information processing systems},
  pages={1693--1701},
  year={2015}
}

Changelog

An extensive changelog is available here.

License

MIT License

Copyright (c) 2023 Paul Landes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

zensols.cnndmdb-0.0.1-py3-none-any.whl (10.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page