Skip to main content


Project description


Added Python Package building



Twint API

News Indexer

The second main part of the project is the crawler and indexer of news.

For this, we use the sitemap xml file of news websites to crawl all the articles. In a sitemap file, we extract the tag sitemap and url.

The sitemap tag is a link to a child sitemap xml file for a specific category of articles in the website.

The url tag represents an article/news of the website.

The root url of a sitemap is stored in a postgres database with a trust level of the website (Oriented, Verified, Fake News, ...) and headers. The headers are the tag we want to extract from the url tag which contains details about the article (title, keywords, publication date, ...).

The headers are the list of fields use in the index pattern of ElasticSearch.

In crawling sitemaps, we insert the new child sitemap in the database with the last modification date or update it for the ones already in the database. The last modification date is used to crawl only sitemaps which change since the last crawling.

The data extracts from the url tags are built in a dataframe then sent in ElasticSearch for further utilisation with the request in Twint API.

In the same time, some sitemaps don't provide the keywords for their articles. Hence, from ElasticSearch we retrieve the entries without keywords. Then, we download the content of the article and extract the keywords thanks to NLP. Finally, we update the entries in ElasticSearch.


For the crawler/indexer:

from TrollHunter.news_crawler import scheduler_news


For updating keywords:

from TrollHunter.news_crawler import scheduler_keywords


Or see with the main use with docker.


  • <input type="checkbox" disabled="" /> Make a better doc
  • <input type="checkbox" disabled="" /> Start the doc

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for TrollHunter, version 0.3.1
Filename, size File type Python version Upload date Hashes
Filename, size TrollHunter-0.3.1.tar.gz (39.1 kB) File type Source Python version None Upload date Hashes View
Filename, size TrollHunter-0.3.1-py3-none-any.whl (64.2 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page