TrollHunter

These details have not been verified by PyPI

Project links

Homepage

Project description

TrollHunter

TrollHunter is a Twitter Crawler & News Website Indexer. It aims at finding Troll Farmers & Fake News on Twitter.

It composed of three parts:

Twint API to extract information about a tweet or a user
News Indexer which indexes all the articles of a website and extract its keywords
Analysis of the tweets and news

Installation

You can either run

pip3 install TrollHunter

or clone the project and run

pip3 install -r requirements.txt

Docker

TrollHunter requires many services to run

ELK ( Elastic Search, Logstash, Kibana)
InfluxDb & Grafana
RabbitMQ

You can either launch them individually if you already have them setup or use our docker-compose.yml

Install Docker
Run docker-compose up -d

Setup

Change the .env with the required values Export the .env variables

export $(cat .env | sed 's/#.*//g' | xargs)

Twitter crawler

Twint

For crawl tweets and extract user's information we use Twint wich allow us to get many information without using Twitter api.

Some of the benefits of using Twint vs Twitter API:

Can fetch almost all Tweets (Twitter API limits to last 3200 Tweets only);
Fast initial setup;
Can be used anonymously and without Twitter sign up;
No rate limitations.

When we used twint, we encountered some problems:

Bad compatibility with windows and datetime
We can't set a limit on the recovery of tweets
Bug with some user-agent

So we decided to fork the project.

With allow us to:

get tweets
get user information
get follow and follower
search tweet from hashtag or word

API

For this we use the open-source framework flask.

Four endpoints are defined and their

/tweets/<string:user>
- get all informations of a user (tweets, follow, interaction)
/search
- crawl every 2 hours tweets corresponding to research
/stop
- stop the search
/tweet/origin
- retrieve the origin of a tweets

Some query parameters are available:

tweet: set to 0 to avoid tweet (default: 1)
follow: set to 0 to avoid follow (default: 1)
limit: set the number of tweet to retrieve (Increments of 20, default: 100)
follow_limit: set the number of following and followers to retrieve (default: 100)
since: date selector for tweets (Example: 2017-12-27)
until: date selector for tweets (Example: 2017-12-27)
retweet: set to 1 to retrieve retweet (default: 0)
search:
- search terms format "i search"
- for hashtag : (#Hashtag)
- for multiple : (#Hashtag1 AND|OR #Hashtag2)
tweet_interact: set to 1 to parse tweet interaction between users (default: 0)
depth: search tweet and info from list of follow

Twitter Storage

Information retrieve with twint is stored in elastic search, we do not use the default twint storage format as we want a stronger relationship parsing. There is currently three index:

twitter_user
twitter_tweet
twitter_interaction

The first and second index are stored as in twitter. The third is build to store interaction from followers/following, conversation and retweet.

Twitter interaction

News Indexer

The second main part of the project is the crawler and indexer of news.

For this, we use the sitemap xml file of news websites to crawl all the articles. In a sitemap file, we extract the tag sitemap and url.

The sitemap tag is a link to a child sitemap xml file for a specific category of articles in the website.

The url tag represents an article/news of the website.

The root url of a sitemap is stored in a postgres database with a trust level of the website (Oriented, Verified, Fake News, ...) and headers. The headers are the tag we want to extract from the url tag which contains details about the article (title, keywords, publication date, ...).

The headers are the list of fields use in the index pattern of ElasticSearch.

In crawling sitemaps, we insert the new child sitemap in the database with the last modification date or update it for the ones already in the database. The last modification date is used to crawl only sitemaps which change since the last crawling.

The data extracts from the url tags are built in a dataframe then sent in ElasticSearch for further utilisation with the request in Twint API.

In the same time, some sitemaps don't provide the keywords for their articles. Hence, from ElasticSearch we retrieve the entries without keywords. Then, we download the content of the article and extract the keywords thanks to NLP. Finally, we update the entries in ElasticSearch.

How it works

Insert a sitemap that you want to crawl with insert_sitemap(loc, lastmod, url_headers, id_trust)
Then run scheduler_news()which will retrieve all the sitemap that you have inserted in the database
You can also run scheduler_keywords() to extract the keywords that are missing from the url that have been fetched.
Every urls found are inserted in elastic.

Run

For the crawler/indexer:

from TrollHunter.news_crawler import scheduler_news

scheduler_news(time_interval)

For updating keywords:

from TrollHunter.news_crawler import scheduler_keywords

scheduler_keywords(time_interval, max_entry)

Or see with the main use with docker.

Grafana

We use grafana for visualizing and monitoring different events with the crawler/indexer as the insertion of an url in ElasticSearch and the extraction of keywords in an article.

alt text

Create new events.

Use TrollHunter.loggers.InfluxDBLog()
Create a new dashboard in grafana, save as json and add it to docker/grafana-provisioning/dashboards

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.5

Mar 27, 2020

0.3.4

Mar 26, 2020

0.3.3

Mar 26, 2020

0.3.1

Mar 26, 2020

0.3.0

Mar 26, 2020

0.2.10

Mar 13, 2020

0.2.9

Mar 13, 2020

0.2.8

Mar 13, 2020

0.2.7

Mar 12, 2020

0.2.6

Mar 12, 2020

0.2.5

Mar 12, 2020

0.2.3

Mar 11, 2020

0.2.2

Mar 10, 2020

0.2.1

Mar 10, 2020

0.2

Mar 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TrollHunter-0.3.5.tar.gz (44.9 kB view details)

Uploaded Mar 27, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

TrollHunter-0.3.5-py3-none-any.whl (68.9 kB view details)

Uploaded Mar 27, 2020 Python 3

File details

Details for the file TrollHunter-0.3.5.tar.gz.

File metadata

Download URL: TrollHunter-0.3.5.tar.gz
Upload date: Mar 27, 2020
Size: 44.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for TrollHunter-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`5fce439c40aa4ba1cb2e59a155e38e1ad4f195b043de437f5dc83f2507747eaf`
MD5	`a6d9258d252f19ac1aaccd46464bd9ea`
BLAKE2b-256	`2ec09fe8c6da3ac3e0dc35cb704b7c4957ac468df1c92c0c06088246eb3b3f5a`

See more details on using hashes here.

File details

Details for the file TrollHunter-0.3.5-py3-none-any.whl.

File metadata

Download URL: TrollHunter-0.3.5-py3-none-any.whl
Upload date: Mar 27, 2020
Size: 68.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for TrollHunter-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24d77a4b5ed389b40f0dd7d808754bf216fc0f22bd0bc3d4abaaf946292b0bb5`
MD5	`629ea12e5b3e3f056976ba9a3cca048d`
BLAKE2b-256	`0b7a5255aa4591a5d52d549bbf6dacf5d3ba52a6b65287b011048c744f8a3ef5`

See more details on using hashes here.

TrollHunter 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TrollHunter

Installation

Docker

Setup

Twitter crawler

Twint

API

Twitter Storage

News Indexer

How it works

Run

Grafana

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes