Skip to main content

Post-truth era news article metadata service.

Project description

Metadoc

Build Status Coverage Status

Metadoc is a post-truth era news article metadata retrieval service. It does social media activity lookup, source authenticity rating, checksum creation, json-ld and metatag parsing as well as information extraction for named entities, pullquotes, fulltext and other useful things based off of arbitrary article URLs. Also, Metadoc is built to be relatively fast.

Example

You just throw it any news article URL, and Metadoc will yield.

from metadoc import Metadoc
url = "https://theintercept.com/2016/11/17/iphones-secretly-send-call-history-to-apple-security-firm-says"
metadoc = Metadoc(url=url)
res = metadoc.query()

=>

{
  '__version__': '0.9.0',
  'authors': ['Kim Zetter'],
  'canonical_url': 'https://theintercept.com/2016/11/17/iphones-secretly-send-call-history-to-apple-security-firm-says/',
  'domain': {
    'credibility': {
      'fake_confidence': '0.00',
      'is_blacklisted': False
    },
    'date_registered': None,
    'favicon': 'https://logo.clearbit.com/theintercept.com?size=200',
    'name': 'theintercept.com'},
    'entities': {
      'keywords': [
        'cellebrite',
        'fbi',
        'skype',
        'intercept'
        ...
      ]
    }
  },
  'image': 'https://theintercept.imgix.net/wp-uploads/sites/1/2016/11/GettyImages-578052668-s.jpg?auto=compress%2Cformat&q=90&fit=crop&w=1200&h=800',
  'language': 'en',
  'modified_date': None,
  'published_date': '2016-11-17T11:00:36+00:00',
  'scraped_date': '2018-07-10T12:13:46+00:00',
  'social': [{
    'metrics': [{
      'count': 7340, 'label': 'sharecount'
    }],
    'provider': 'facebook'
  }],
  'text': {
    'contenthash': '940a62c70db255b4aec378529ae7a2c8',
    'fulltext': 'a guardian of user privacy this year after fighting FBI
      demands to help crack into San Bernardino shooter Syed ...',
    'reading_time': 439,
    'summary': 'Your call logs get sent to Apple’s servers whenever iCloud is on — something Apple does not disclose.'
  },
  'title': 'iPhones Secretly Send Call\xa0History to Apple, Security Firm Says',
  'url': 'https://theintercept.com/2016/11/17/iphones-secretly-send-call-history-to-apple-security-firm-says'
}

Trustworthiness Check

Metadoc does a basic background check on article sources. This means a simple blacklist-lookup via whois data on the domain. Blacklists taken into account include the controversial PropOrNot. Thus, only if a domain is found on every blacklist do we spit out a fake_confidence of 1. The resulting metadata should be taken with a grain of salt.

Part-of-speech tagging

For speed and simplicity, we decided against nltk and instead rely on the Averaged Perceptron as imagined by Matthew Honnibal @explosion. The pip install comes pre-trained with a CoNLL 2000 training set which works reasonably well to detect proper nouns. Since training is non-deterministic, unwanted stopwords might slip through. If you want to try out other datasets, simply replace metadoc/extract/data/training_set.txt with your own and run metadoc.extract.pos.do_train.

Purpose

This library is used in the context of a news-related software undertaking called Praise. We're building the first social network dedicated to quality journalism recommendations. Synthesizing what we dub "audience-evaluated content" with automated metadata. If you're intrigued and might want to work with us, feel free to drop a line to a@praise.press.

Install

Requires python 3.5.

Using pip

pip install metadoc

Develop

Mac OS

brew install python3 libxml2 libxslt libtiff libjpeg webp little-cms2

Ubuntu

apt-get install -y python3 libxml2-dev libxslt-dev libtiff-dev libjpeg-dev webp whois

Fedora/Redhat

dnf install libxml2-devel libxslt-devel libtiff-devel libjpeg-devel libjpeg-turbo-devel libwebp whois

Then

pip3 install -r requirements-dev.txt
python serve.py => serving @ 6060

Test

py.test -v tests

If you happen to run into an error with OSX 10.11 concerning a lazy bound library in PIL,
just remove /PIL/.dylibs/liblzma.5.dylib.

Todo

  • Page concatenation is needed in order to properly calculate wordcount and reading time.
  • Authenticity heuristic with sharecount deviance detection (requires state).
  • Perf: Worst offender is nltk's pos tagger. Roll own w/ Average Perceptron.
  • Newspaper's summarize produces pullquotes, fulltext takes a while. Move to libextract?

Contributors

Martin Borho
Paul Solbach


Meteadoc is a software product of Praise Internet UG, Hamburg.
Metadoc stems from a pedigree of nice libraries like goose3, langdetect and nltk.
Metadoc leans on this perceptron implementation inspired by Matthew Honnibal.
Metadoc is a work-in-progress.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metadoc-0.10.5.tar.gz (41.9 MB view details)

Uploaded Source

Built Distribution

metadoc-0.10.5-py3-none-any.whl (42.6 MB view details)

Uploaded Python 3

File details

Details for the file metadoc-0.10.5.tar.gz.

File metadata

  • Download URL: metadoc-0.10.5.tar.gz
  • Upload date:
  • Size: 41.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for metadoc-0.10.5.tar.gz
Algorithm Hash digest
SHA256 960acab7dc692295a23676f02f584656fb9799ed6dc8354db243ee956f5159a0
MD5 113aa2c7ea266bf6ed86d6355c5c8d82
BLAKE2b-256 e886cf35f2e843e8054da3173e0d18faf4cbb1e23fb1bf9c30ade0121e522512

See more details on using hashes here.

File details

Details for the file metadoc-0.10.5-py3-none-any.whl.

File metadata

  • Download URL: metadoc-0.10.5-py3-none-any.whl
  • Upload date:
  • Size: 42.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for metadoc-0.10.5-py3-none-any.whl
Algorithm Hash digest
SHA256 74885374c83aa8a8624f5e3ef4613379757e8287b3a970004eb75e536f7b1b25
MD5 19adfa37482db0719ca99196f3400158
BLAKE2b-256 840886fbb4a63b942000a4e29301e02d2830b90f47bfc9e7eabc71a6c5a62ab4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page