Skip to main content

Twitter data collection library

Project description

[Author] PyPi [Python 3.7] license

Python Twitter data collector built on Tweepy that allow users to dynamically collect accounts and statuses from Twitter during streaming, and automatically generate Datasets from collected data that you can as CSV.

This library provides a framework that you can use to build your own data collector, specifying which are your features that have to be extracted from Twitter accounts/statuses.

Creating your Twitter dataset: 1. Instantiate an AccountCollector and/or StatusCollector class in according to what you want collect, accounts, statuses or both. At this step you can re-defined your own features that have to be extracted from twitter data, you have to pass dict-like parameters in the following form: <feature_name, function> where the function has to be applied to the user or status object. Please refer to documentation for more details about Twitter objects 2. Instantiate the OnlineStreamer passing the collector as parameter 3. Start streaming on some topics 4. Save the created dataset at specified location

NB: It is not mandatory to use both collectors and streamer but you can also use Collectors alone, for instance if you already have the users and/or statuses to collect and you don’t need to stream anything.

NEW FEATURES: * Offline collection by name, allow user to make a query by name and collect some name-similar users extracting features defined in the collector constructor

INSTALLATION

The package is available on PyPi here

$ pip install ptdc

EXAMPLE USAGE

Import modules

from ptdc import authenticate, AccountCollector, OnlineStreamer, StatusCollector

Define tokens

Replace the following tokens with yours, see Twitter developers authentication for more details about how obtain them.

consumer_key = "xxxxxxxxxxx"
consumer_key_secret = "xxxxxxxxxxxxx"
access_token = "xxxxxxxxxxxxxxxxxxxxxx"
access_token_secret = "xxxxxxxxxxxxxxxxxx"

Create the default Tweepy API object of tweepy

api = authenticate(consumer_key=consumer_key, consumer_key_secret=consumer_key_secret, access_token=access_token, access_token_secret=access_token_secret)

Create your own Collectors for collecting data

Create your own StatusCollector object

s_collector = StatusCollector(api=api)

Create your own AccountCollector object, which will collect also statuses

collector = AccountCollector(api=api, statuses_collector=s_collector)

Create the Streamer

Create Online Streamer that will collect data (in this case will collect only 5 accounts)

streamer = OnlineStreamer(api=api, collector=collector, data_limit=5, n_statuses=400)

Start streaming

You can start streaming in all ways defined by Tweepy, see its doc for more details

streamer.stream(track=['famous', 'web', 'vip', 'holiday', 'pic', 'photo'], is_async=False)

Save dataset/s

After streaming ended (in according to your defined limits), save DataFrame/s generated into csv file/s. You just need to access the collector object and call the save_dataset method providing the path.

streamer.collector.save_dataset(path="../data/accounts.csv")

Questions and Contributing

Feel free to post questions and problems on the issue tracker. Pull requests are welcome!

Feel free to fork and modify or add new features and functionality to the library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ptdc-1.3.6.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

ptdc-1.3.6-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file ptdc-1.3.6.tar.gz.

File metadata

  • Download URL: ptdc-1.3.6.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for ptdc-1.3.6.tar.gz
Algorithm Hash digest
SHA256 8b260fa671a22cb6da9926c22bf8a534f55ca9c49554b545f448e7d19606ff29
MD5 64cafd80694f2b41707899bab7d89902
BLAKE2b-256 facdfbaca49ce420fbde077676f3622e07a1523a28fd522aaf91af2e9455c88f

See more details on using hashes here.

File details

Details for the file ptdc-1.3.6-py3-none-any.whl.

File metadata

  • Download URL: ptdc-1.3.6-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for ptdc-1.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3989fb71f47f010d3666a45e82f1b61188fa6e7a776ab41f4be41ac2810c58e6
MD5 57008360e052f26776909ac8b3a1301f
BLAKE2b-256 9b1197f701a573e265ea369aca35b9d3645e84ebb778f3b225439c81e76b247a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page