Generate long-term longitudinal Google Trends Data
Project description
pytrends_longitudinal
Introduction
This is a python library for downloading cross-section and time-series Google Trends and converting them to longitudinal data.
Although Google Trends provides cross-section and time-series search data, longitudinal Google Trends data are not readily available. There exist several practical issues that make it difficult for researchers to generate longitudinal Google Trends data themselves. First, Google Trends provides normalized counts from zero to 100. As a result, combining different regions' time-series Google Trends data does not create desired longitudinal data. For the same reason, combining cross-sectional Google Trends data over time does not create desired longitudinal data. Second, Google Trends has restrictions on data formats and timeline. For instance, if you want to collect daily data for 2 years, you cannot do so. Google Trends automatically provides weekly data if your request timeline is more than 269 days. Similarly, Google Trends automatically provides monthly data if your request timeline is more than 269 weeks even though you want to collect weekly data.
The pytrends_longitudinal library resolves the aforementioned issues and allows researchers to generate longitudinal Google Trends.
This library is built on top of another library pytrends
which also have few dependencies. As long as Google Trends API
, pytrends
and all their dependencies work, pytrends_longitudinal
will also work!
Table of contents
- Installation
- Requirements
- Initiate
pytrends_longitudinal
- Methods
- WARNING
cross_section
time_series
concat_time_series
convert_cross_section
all_in_one_method
- Caveats
- Credits
- Disclaimer
Installation
pip install pytrends-longitudinal
Requirements
pip install -r requiremnts.txt
Initiate pytrends_longitudinal
from pytrends_longitudinal import RequestTrends
import datetime as dt
day_data = RequestTrends(keyword='Insomnia', topic='/m/0ddwt', folder_name='insomnia_save', start_date=dt.datetime(2021, 11, 1), end_date=dt.datetime(2022,10,24), data_format='daily')
The initiator call will initiate pytrends
that initiates the Google Trends API
. In the initiation stage, two folders will be created automatically
- Parent folder that the users will choose the name of and
- Folder corresponding to the data_format.
So all the daily data will be stored under 'daily' folder for daily data, 'weekly' folder for weekly data and so on.
Parameters
keyword
- The keyword to be used for collecting google trends data
topic
- The topic of the keyword. If any topic is to be used instead of search term.
- For example, '/m/0ddwt' will give google trends data for Insomnia as topic of 'Disorder'.
- NOTE: URL's have certain codes for special characters. For example,
%20
= white space,%2F
= / (forward slash) etc.
- NOTE: URL's have certain codes for special characters. For example,
- If the topic and keyword are the same, then data provided will be for google trends search term and not any particular topic. So,
keyword='Insomnia', topic='Insomnia'
will provide google trends data for Insomnia as search term.
- For example, '/m/0ddwt' will give google trends data for Insomnia as topic of 'Disorder'.
- The topic of the keyword. If any topic is to be used instead of search term.
folder_name
- Name of folder to be created to save all the data
start_date
- Date to start from
end_date
- Date to end at
data_format
- Time basis query
- Can choose only one from the list: ['daily', 'weekly', 'monthly']
Methods
WARNING
Please make sure to run the methods in the following sequence:
cross_section
time_series
concat_time_series
convert_time_series
We have noticed some unusual behaviors if not run in the given sequence. Firstly concat_time_series
depends on time_series
and convert_cross_section
depends on all the three. We have noticed if time_series
is ran before cross_section
then sometimes the output gets influenced by time_series
parameters. We are troubleshooting the issue. Until then, please follow the sequence to attain the expected result.
cross_section
day_data.cross_section(geo='US', resolution="REGION")
This method will collect cross section data of the given keyword and timeline. It calls pytrends.interest_by_region()
method from pytrends. The data is automatically saved in → 'folder_name'/'data_format'/by_region. Each file has data for the given region/countries all the country/state google trends index for 1 day/week/month. The filenames tells the date of the data time period and also has an indication of number of day/week/month.
For more information on pytrends interest_by_region()
method, check here.
PS: This method takes a long time to finish running. For example, it takes around 5 hours to collect 350 days of daily data. The time is mainly due to Google Trends API rate limit and resetting the limit.
Parameters
geo
- Country/Region to collect data from. If left empty, then result will be worldwide i.e. data will be collected for all country. If left empty, defaults to worldwide country level.
resolution
- 'COUNTRY' returns country level data
- 'REGION' returns region level data
- 'CITY' returns city level data
- Defaults to country
time_series
day_data.time_series(reference_geo='US-AL')
This method will collect over time data. It calls pytrends.interest_over_time()
method from pytrends. For time series google trends data, by default google will provide weekly data if the days between start and end date is more than 270 days and will provide monthly data if the difference is more than 270 weeks. To tackle that problem, this method will collect the daily/weekly data into chunks less then 270 days/weeks. The collected data will be saved under → 'folder_name'/'data_format'/over_time/'reference_geo
For more information on pytrends interest_over_time()
method, check here.
Parameters
reference_geo
- Country/State/City to be used as reference point to rescale the data in later part
concat_time_series
day_data.concat_time_series(reference_geo='US-AL', zero_replace=0.1)
This method will concat the time series data collected in time_series()
method. Because the data points in time_series
is independent of each other, they needs to be re-aligned to get correct index for the given time period. This method concatenates time_series
data for all the period and gives back the combined rescaled time_series
data for the reference timeline. This rescaled time_series
data will be used in the next method to rescale the cross_section
data.
Parameters
reference_geo
- This is the same
geo
code that is used in collectingtime_series
data. If the time_series data for that geo is not collected beforehand, or the file does not exist, it will throw and error. Default is 'US'
- This is the same
zero_replace
- As data from different time periods are rescaled, sometimes the last/first data point of a period might be zero. Then the calculation will throw error or everything single data point will become zero. To avoid that, we are tweaking the zeroes to be of an insignificant number to carry on with the calculation.
convert_cross_section
day_data.convert_cross_section(reference_geo='US-AL', zero_replace=0.1)
This final method will rescale the cross section data based on the concatenated time series data. This will finally provide the accurate google trends index for each region/country/city over the provided time period.
Parameters
reference_geo
- Same as the reference_geo from
concat_time_series()
. If anyother is used, then the result will not be accurate
- Same as the reference_geo from
zero_replace
- Same as zero_replace from
concat_time_series()
. It is highly recommended to use the same to avoid incosistent results.
- Same as zero_replace from
all_in_one_method
day_data.all_in_one_method(geo='US', reference_geo='US-AL', zero_replace=0.1)
This last method combines all the methods together and executes them in the correct sequence. It will collect the cross_section & time_series data, concat the time_series data and finally rescale the cross section data all in one go. All the files will be present for cross reference.
Note that the sequence of the first two methods cross_section()
& time_series()
don't matter since they are independent. However, the later two are depended on the first two. concat_time_series()
is depended on time_series()
and convert_cross_section()
is depended on both concat_time_series()
and cross_section()
.
Parameters
geo
- Same as
geo
fromcross_section()
- Same as
reference_geo
- Same as
reference_geo
fromtime_series()
andconcat_time_series()
- Same as
zero_replace
- Same as
zero_replace
fromconcat_time_series()
andconvert_cross_section()
- Same as
Caveats
This is not an Official or Supported API.
pytrends_longitudinal
is built on top of pytrends
. pytrends
uses Google Trends API
to collect trends data. So we do not have any control over the accuracy or quality of the trends data. It has been observed during tests that for the same inputs (keyword, topic, data_format, timeline), outputs were little different.
zero_replace
is used to avoid division errors. But when the zero_replace
is very small number, and there are a lot of zeroes in the dataset, then the final output will contain very big numbers. However, there is no specific rule or recommendation for the zero_replace
. Its gonna be a trial & error.
On that note, if the search term is not very popular, then the resultant dataset will contain a lot of zeroes that will hugely impact the final outcome.
Credits
pytrends
library
Acknowledgement
This publication was made possible by the generous support of the Qatar Foundation through Carnegie Mellon University in Qatar's Seed Research program. The statements made herein are solely the responsibility of the authors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pytrends_longitudinal-0.1.9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 432ea2379990946d842061b1125f1131b6dc1926e1b9dead602038c03f0741fe |
|
MD5 | b2c0961a5d63483a2ff5fa122cdf0b1c |
|
BLAKE2b-256 | 272afaa645607ceef5666044c15421fbc14e29b735120abdc26973593c54e353 |
Hashes for pytrends_longitudinal-0.1.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f72429509392e962ba240173a13582e47149f024650b41e27653a8e32e486473 |
|
MD5 | 60cbc3c17eec108dbac56e9b3c9fcb50 |
|
BLAKE2b-256 | d4e2bcb70a8142e3512db80235b0b08b2850839cc07bd75cefe3583fb036446b |