Skip to main content

Creates a SQLite database if the CNN and DailyMail summarization dataset.

Project description

CNN/DailyMail Dataset as SQLite

PyPI Python 3.9 Python 3.10 Build Status

Creates a SQLite database if the CNN and DailyMail summarization dataset.

Documentation

See the full documentation. The API reference is also available.

Obtaining

The easiest way to install the command line program is via the pip installer:

pip3 install zensols.cnndmdb

Binaries are also available on pypi.

Usage

First create the SQLite database file: cnndmdb load and check to make sure the file data/cnn.sqlite3 was created. This takes a while since the entire corpus is first downloaded and then inserted into the SQLite file.

Command Line

The SQLite database keys can be given:

cnndmdb keys

Then the command line can also be used to print articles:

cnndmdb show -t org 3b07f5102c69e3e609d73b2ccb0dc5549d4fbaf6

The -t org tells it to use the original corpus keys. This option also allows for selected SQLite rowid keys or a Kth smallest article.

API

The corpus objects are accessible as mapped Python objects. For example:

corpus: Corpus = ApplicationFactory.get_corpus()
art: Article = next(iter(corpus.stash.values()))
print(art.text)

Data Source

The data is sourced from a Tensorflow dataset, which in turn uses the Abigail See GitHub repository.

@article{DBLP:journals/corr/SeeLM17,
  author    = {Abigail See and
               Peter J. Liu and
               Christopher D. Manning},
  title     = {Get To The Point: Summarization with Pointer-Generator Networks},
  journal   = {CoRR},
  volume    = {abs/1704.04368},
  year      = {2017},
  url       = {http://arxiv.org/abs/1704.04368},
  archivePrefix = {arXiv},
  eprint    = {1704.04368},
  timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{hermann2015teaching,
  title={Teaching machines to read and comprehend},
  author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
  booktitle={Advances in neural information processing systems},
  pages={1693--1701},
  year={2015}
}

Changelog

An extensive changelog is available here.

License

MIT License

Copyright (c) 2023 Paul Landes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

zensols.cnndmdb-0.0.1-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file zensols.cnndmdb-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for zensols.cnndmdb-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a1b3c29a44f6525e3cfde91a95737453e530ecc763304dfa72fafb36badb7a4
MD5 e80015f4297ba313fe02c13b2017058e
BLAKE2b-256 684bab5e8401c6ae10b0dec3bf962ce1d3978d7778931c2b628ccec8ed620564

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page