Skip to main content

Wrapper for Twitter's Premium and Enterprise search APIs

Project description

Python Twitter Search API

This library serves as a python interface to the Twitter premium and enterprise search APIs. It provides a command-line utility and a library usable from within python. It comes with tools for assisting in dynamic generation of search rules and for parsing tweets.

Pretty docs can be seen here.

Features

  • Command-line utility is pipeable to other tools (e.g., jq).

  • Automatically handles pagination of results with specifiable limits

  • Delivers a stream of data to the user for low in-memory requirements

  • Handles Enterprise and Premium authentication methods

  • Flexible usage within a python program

  • Compatible with our group’s Tweet Parser for rapid extraction of relevant data fields from each tweet payload

  • Supports the Counts API, which can reduce API call usage and provide rapid insights if you only need volumes and not tweet payloads

Installation

We will host the package on PyPi so it’s pip-friendly.

pip install searchtweets

Or the development version locally via

git clone https://github.com/twitterdev/search-tweets-python
cd search-tweets-python
pip install -e .

Using the Comand Line Application

We provide a utility, search_tweets.py, in the tools directory that provides rapid access to tweets. Premium customers should use --bearer-token; enterprise customers should use --user-name and --password.

The --endpoint flag will specify the full URL of your connection, e.g.:

https://api.twitter.com/1.1/tweets/search/30day/dev.json

You can find this url in your developer console.

Note that the --results-per-call flag specifies an argument to the API call ( maxResults, results returned per CALL), not as a hard max to number of results returned from this program. use --max-results for that for now.

Stream json results to stdout without saving

python search_tweets.py \
  --bearer-token <BEARER_TOKEN> \
  --endpoint <MY_ENDPOINT> \
  --max-results 1000 \
  --results-per-call 100 \
  --filter-rule "beyonce has:hashtags" \
  --print-stream

Stream json results to stdout and save to a file

python search_tweets.py \
  --user-name <USERNAME> \
  --password <PW> \
  --endpoint <MY_ENDPOINT> \
  --max-results 1000 \
  --results-per-call 100 \
  --filter-rule "beyonce has:hashtags" \
  --filename-prefix beyonce_geo \
  --print-stream

Save to file without output

python search_tweets.py \
  --user-name <USERNAME> \
  --password <PW> \
  --endpoint <MY_ENDPOINT> \
  --max-results 100 \
  --results-per-call 100 \
  --filter-rule "beyonce has:hashtags" \
  --filename-prefix beyonce_geo \
  --no-print-stream

It can be far easier to specify your information in a configuration file. An example file can be found in the tools/api_config_example.config file, but will look something like this:

[credentials]
account_name = <account_name>
username =  <user_name>
password = <password>
bearer_token = <token>

[api_info]
endpoint = <endpoint>

[gnip_search_rules]
from_date = 2017-06-01
to_date = 2017-09-01
results_per_call = 100
pt_rule = beyonce has:hashtags


[search_params]
max_results = 500

[output_params]
output_file_prefix = beyonce

Soon, we will update this behavior and remove the credentials section from the config file to be handled differently.

When using a config file in conjunction with the command-line utility, you need to specify your config file via the --config-file parameter. Additional command-line arguments will either be added to the config file args or overwrite the config file args if both are specified and present.

Example:

python search_tweets.py \
  --config-file myapiconfig.config \
  --no-print-stream

Using the Twitter Search API within Python

Working with the API within a Python program is straightforward both for Premium and Enterprise clients.

Our group’s python tweet parser library is a requirement.

Prior to starting your program, an easy way to define your secrets will be setting an environment variable. If you are an enterprise client, your authentication will be a (username, password) pair. If you are a premium client, you’ll need to get a bearer token that will be passed with each call for authentication.

Your credentials should be put into a YAML file that looks like this:

search_tweets_api:
  endpoint: <FULL_URL_OF_ENDPOINT>
  account: <ACCOUNT_NAME>
  username: <USERNAME>
  password: <PW>
  bearer_token: <TOKEN>

And filling in the keys that are appropriate for your account type. Premium users should only have the endpoint and bearer_token; Enterprise customers should have account, username, endpoint, and password.

Our credential reader expects this file at "~/.twitter_keys.yaml", but you can pass the relevant location as needed.

The following cell demonstrates the basic setup that will be referenced throughout your program’s session.

from searchtweets import ResultStream, gen_rule_payload, load_credentials

Enterprise setup

If you are an enterprise customer, you’ll need to authenticate with a basic username/password method. You can specify that here:

enterprise_search_args = load_credentials("~/.twitter_keys.yaml",
                                          account_type="enterprise")

Premium Setup

Premium customers will use a bearer token for authentication. Use the following cell for setup:

premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                       account_type="premium")

There is a function that formats search API rules into valid json queries called gen_rule_payload. It has sensible defaults, such as pulling more tweets per call than the default 100 (but note that a sandbox environment can only have a max of 100 here, so if you get errors, please check this) not including dates, and defaulting to hourly counts when using the counts api. Discussing the finer points of generating search rules is out of scope for these examples; I encourage you to see the docs to learn the nuances within, but for now let’s see what a rule looks like.

rule = gen_rule_payload("beyonce", results_per_call=100) # testing with a sandbox account
print(rule)
{"query":"beyonce","maxResults":100}

This rule will match tweets that have the text beyonce in them.

From this point, there are two ways to interact with the API. There is a quick method to collect smaller amounts of tweets to memory that requires less thought and knowledge, and interaction with the ResultStream object which will be introduced later.

Fast Way

We’ll use the search_args variable to power the configuration point for the API. The object also takes a valid PowerTrack rule and has options to cutoff search when hitting limits on both number of tweets and API calls.

We’ll be using the collect_results function, which has three parameters.

  • rule: a valid powertrack rule, referenced earlier

  • max_results: as the api handles pagination, it will stop collecting when we get to this number

  • result_stream_args: configuration args that we’ve already specified.

For the remaining examples, please change the args to either premium or enterprise depending on your usage.

Let’s see how it goes:

from searchtweets import collect_results
tweets = collect_results(rule,
                         max_results=100,
                         result_stream_args=enterprise_search_args) # change this if you need to

By default, tweet payloads are lazily parsed into a Tweet object. An overwhelming number of tweet attributes are made available directly, as such:

[print(tweet.all_text) for tweet in tweets[0:10]];
That deep sigh Beyoncé took once she realized she wouldn’t be able to get the earpiece out of her hair before the dance break 😂.  https://t.co/dU1K2KMT7i
4 Years ago today, "BEYONCÉ" by Beyoncé was surprise released. It received acclaim from critics,  debuted at #1 and certified 2x Platinum in the US. https://t.co/wB3C7DuX9o
me mata la gente que se cree superior por sus gustos de música escuches queen beyonce o el polaco no sos mas ni menos que nadie
I’m literally not Beyoncé https://t.co/LwIkllCx6P
#BEYONCÉ ‣ 𝐌𝐄𝐀𝐃𝐃𝐅𝐀𝐍 𝐎𝐅𝐈𝐂𝐈𝐀𝐋 - I Am... 𝐖𝐎𝐑𝐋𝐃 𝐓𝐎𝐔𝐑! https://t.co/TyyeDdXKiM
Beyoncé on how nervous she was to release her self-titled... https://t.co/fru23c6DYC
AAAA ansiosa por esse feat da Beyoncé com Jorge Ben Jor &lt;3 https://t.co/NkKJhC9JUd
I am world tour, the Beyonce experience, revamped hmt. https://t.co/pb07eMyNka
Tell me what studio versions of any artists would u like me to do? https://t.co/Z6YWsAJuhU
Billboard's best female artists over the last decade:

2017: Ariana Grande
2016: Adele
2015: Taylor Swift
2014: Katy Perry
2013: Taylor Swift
2012: Adele
2011: Adele
2010: Lady Gaga
2009: Taylor Swift
2008: Rihanna

Beyonce = 0

Taylor Swift = 3 👑
Beyoncé explaining her intent behind the BEYONCÉ visual album &amp; how she wanted to reinstate the idea of an album release as a significant, exciting event which had lost meaning in the face of hype created around singles. 👑 https://t.co/pK2MB35vYl
[print(tweet.created_at_datetime) for tweet in tweets[0:10]];
2017-12-13 21:18:17
2017-12-13 21:18:16
2017-12-13 21:18:16
2017-12-13 21:18:15
2017-12-13 21:18:15
2017-12-13 21:18:13
2017-12-13 21:18:12
2017-12-13 21:18:12
2017-12-13 21:18:11
2017-12-13 21:18:10
[print(tweet.generator.get("name")) for tweet in tweets[0:10]];
Twitter for Android
Twitter for Android
Twitter for Android
Twitter for iPhone
Meadd
Twitter for iPhone
Twitter for Android
Twitter for iPhone
Twitter for iPhone
Twitter for Android

Voila, we have some tweets. For interactive environments and other cases where you don’t care about collecting your data in a single load or don’t need to operate on the stream of tweets or counts directly, I recommend using this convenience function.

Working with the ResultStream

The ResultStream object will be powered by the search_args, and takes the rules and other configuration parameters, including a hard stop on number of pages to limit your API call usage.

rs = ResultStream(rule_payload=rule,
                  max_results=500,
                  max_pages=1,
                  **premium_search_args)

print(rs)
ResultStream:
    {
    "username":null,
    "endpoint":"https://api.twitter.com/1.1/tweets/search/30day/dev.json",
    "rule_payload":{
        "query":"beyonce",
        "maxResults":100
    },
    "tweetify":true,
    "max_results":500
}

There is a function, .stream, that seamlessly handles requests and pagination for a given query. It returns a generator, and to grab our 500 tweets that mention beyonce we can do this:

tweets = list(rs.stream())

Tweets are lazily parsed using our Tweet Parser, so tweet data is very easily extractable.

# using unidecode to prevent emoji/accents printing
[print(tweet.all_text) for tweet in tweets[0:10]];
Everyone: still dragging Jay for cheating

Beyoncé: https://t.co/2z1ltlMQiJ
Beyoncé changed the game w/ that digital drop 4 years ago today! 🎉

• #1 debut on Billboard
• Sold 617K in the US / over 828K WW in only 3 days
• Fastest-selling album on iTunes of all time
• Reached #1 in 118 countries
• Widespread acclaim; hailed as her magnum opus https://t.co/lDCdVs6em3
Beyoncé 🔥 #444Tour https://t.co/sCvZzjLwqx
Se presentan casos de feminismo pop basado en sugerencias de artistas famosos en turno, Emma Watson, Beyoncé.
Beyonce. Are you kidding me with this?! #Supreme #love #everything
Dear Beyoncé, https://t.co/5visfVK2LR
At this time 4 years ago today, Beyoncé released her self-titled album BEYONCÉ exclusively on the iTunes Store without any prior announcement. The album remains the ONLY album in history to reach #1 in 118 countries &amp; the fastest-selling album in the history of the iTunes Store. https://t.co/ZZb4QyQYf0
4 years ago today, Beyoncé released her self-titled visual album "BEYONCÉ" and shook up the music world forever. 🙌🏿 https://t.co/aGtUSq9R3u
Everyone: still dragging Jay for cheating

Beyoncé: https://t.co/2z1ltlMQiJ
And Beyonce hasn't had a solo #1 hit since the Bush administration soooo... https://t.co/WCd7ni8DwN

Counts API

We can also use the counts api to get counts of tweets that match our rule. Each request will return up to 30 results, and each count request can be done on a minutely, hourly, or daily basis. The underlying ResultStream object will handle converting your endpoint to the count endpoint, and you have to specify the count_bucket argument when making a rule to use it.

The process is very similar to grabbing tweets, but has some minor differneces.

Caveat - premium sandbox environments do NOT have access to the counts API.

count_rule = gen_rule_payload("beyonce", count_bucket="day")

counts = collect_results(count_rule, result_stream_args=enterprise_search_args)

Our results are pretty straightforward and can be rapidly used.

counts
[{'count': 85660, 'timePeriod': '201712130000'},
 {'count': 95231, 'timePeriod': '201712120000'},
 {'count': 114540, 'timePeriod': '201712110000'},
 {'count': 165964, 'timePeriod': '201712100000'},
 {'count': 102022, 'timePeriod': '201712090000'},
 {'count': 87630, 'timePeriod': '201712080000'},
 {'count': 195794, 'timePeriod': '201712070000'},
 {'count': 209629, 'timePeriod': '201712060000'},
 {'count': 88742, 'timePeriod': '201712050000'},
 {'count': 96795, 'timePeriod': '201712040000'},
 {'count': 177595, 'timePeriod': '201712030000'},
 {'count': 120102, 'timePeriod': '201712020000'},
 {'count': 186759, 'timePeriod': '201712010000'},
 {'count': 151212, 'timePeriod': '201711300000'},
 {'count': 79311, 'timePeriod': '201711290000'},
 {'count': 107175, 'timePeriod': '201711280000'},
 {'count': 58192, 'timePeriod': '201711270000'},
 {'count': 48327, 'timePeriod': '201711260000'},
 {'count': 59639, 'timePeriod': '201711250000'},
 {'count': 85201, 'timePeriod': '201711240000'},
 {'count': 91544, 'timePeriod': '201711230000'},
 {'count': 64129, 'timePeriod': '201711220000'},
 {'count': 92065, 'timePeriod': '201711210000'},
 {'count': 101617, 'timePeriod': '201711200000'},
 {'count': 84733, 'timePeriod': '201711190000'},
 {'count': 74887, 'timePeriod': '201711180000'},
 {'count': 76091, 'timePeriod': '201711170000'},
 {'count': 81849, 'timePeriod': '201711160000'},
 {'count': 58423, 'timePeriod': '201711150000'},
 {'count': 78004, 'timePeriod': '201711140000'},
 {'count': 118077, 'timePeriod': '201711130000'}]

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

searchtweets-1.0.tar.gz (25.5 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page