Generate long-term longitudinal Google Trends Data
Project description
pytrends_longitudinal
Introduction
This is a python library for downloading cross-section and time-series Google Trends and converting them to longitudinal data.
Although Google Trends provides cross-section and time-series search data, longitudinal Google Trends data are not readily available. There exist several practical issues that make it difficult for researchers to generate longitudinal Google Trends data themselves. First, Google Trends provides normalized counts from zero to 100. As a result, combining different regions' time-series Google Trends data does not create desired longitudinal data. For the same reason, combining cross-sectional Google Trends data over time does not create desired longitudinal data. Second, Google Trends has restrictions on data formats and timeline. For instance, if you want to collect daily data for 2 years, you cannot do so. Google Trends automatically provides weekly data if your request timeline is more than 269 days. Similarly, Google Trends automatically provides monthly data if your request timeline is more than 269 weeks even though you want to collect weekly data.
The pytrends_longitudinal library resolves the aforementioned issues and allows researchers to generate longitudinal Google Trends.
This library is built on top of another library pytrends which also have few dependencies. As long as Google Trends API, pytrends and all their dependencies work, pytrends_longitudinal will also work!
Table of contents
- Installation
- Requirements
- Initiate
pytrends_longitudinal - Methods
- WARNING
cross_sectiontime_seriesconcat_time_seriesconvert_cross_sectionall_in_one_method
- Caveats
- Credits
- Disclaimer
Installation
pip install pytrends-longitudinal
Requirements
pip install -r requiremnts.txt
Initiate pytrends_longitudinal
from pytrends_longitudinal import RequestTrends
import datetime as dt
day_data = RequestTrends(keyword='Insomnia', topic='/m/0ddwt', folder_name='insomnia_save', start_date=dt.datetime(2021, 11, 1), end_date=dt.datetime(2022,10,24), data_format='daily')
The initiator call will initiate pytrends that initiates the Google Trends API. In the initiation stage, two folders will be created automatically
- Parent folder that the users will choose the name of and
- Folder corresponding to the data_format.
So all the daily data will be stored under 'daily' folder for daily data, 'weekly' folder for weekly data and so on.
Parameters
keyword- The keyword to be used for collecting google trends data
topic- The topic of the keyword. If any topic is to be used instead of search term.
- For example, '/m/0ddwt' will give google trends data for Insomnia as topic of 'Disorder'.
- NOTE: URL's have certain codes for special characters. For example,
%20= white space,%2F= / (forward slash) etc.
- NOTE: URL's have certain codes for special characters. For example,
- If the topic and keyword are the same, then data provided will be for google trends search term and not any particular topic. So,
keyword='Insomnia', topic='Insomnia'will provide google trends data for Insomnia as search term.
- For example, '/m/0ddwt' will give google trends data for Insomnia as topic of 'Disorder'.
- The topic of the keyword. If any topic is to be used instead of search term.
folder_name- Name of folder to be created to save all the data
start_date- Date to start from
end_date- Date to end at
data_format- Time basis query
- Can choose only one from the list: ['daily', 'weekly', 'monthly']
Methods
WARNING
Please make sure to run the methods in the following sequence:
cross_sectiontime_seriesconcat_time_seriesconvert_time_series
We have noticed some unusual behaviors if not run in the given sequence. Firstly concat_time_series depends on time_series and convert_cross_section depends on all the three. We have noticed if time_series is ran before cross_section then sometimes the output gets influenced by time_series parameters. We are troubleshooting the issue. Until then, please follow the sequence to attain the expected result.
cross_section
day_data.cross_section(geo='US', resolution="REGION")
This method will collect cross section data of the given keyword and timeline. It calls pytrends.interest_by_region() method from pytrends. The data is automatically saved in → 'folder_name'/'data_format'/by_region. Each file has data for the given region/countries all the country/state google trends index for 1 day/week/month. The filenames tells the date of the data time period and also has an indication of number of day/week/month.
For more information on pytrends interest_by_region() method, check here.
PS: This method takes a long time to finish running. For example, it takes around 5 hours to collect 350 days of daily data. The time is mainly due to Google Trends API rate limit and resetting the limit.
Parameters
geo- Country/Region to collect data from. If left empty, then result will be worldwide i.e. data will be collected for all country. If left empty, defaults to worldwide country level.
resolution- 'COUNTRY' returns country level data
- 'REGION' returns region level data
- 'CITY' returns city level data
- Defaults to country
time_series
day_data.time_series(reference_geo='US-AL')
This method will collect over time data. It calls pytrends.interest_over_time() method from pytrends. For time series google trends data, by default google will provide weekly data if the days between start and end date is more than 270 days and will provide monthly data if the difference is more than 270 weeks. To tackle that problem, this method will collect the daily/weekly data into chunks less then 270 days/weeks. The collected data will be saved under → 'folder_name'/'data_format'/over_time/'reference_geo
For more information on pytrends interest_over_time() method, check here.
Parameters
reference_geo- Country/State/City to be used as reference point to rescale the data in later part
concat_time_series
day_data.concat_time_series(reference_geo='US-AL', zero_replace=0.1)
This method will concat the time series data collected in time_series() method. Because the data points in time_series is independent of each other, they needs to be re-aligned to get correct index for the given time period. This method concatenates time_series data for all the period and gives back the combined rescaled time_series data for the reference timeline. This rescaled time_series data will be used in the next method to rescale the cross_section data.
Parameters
reference_geo- This is the same
geocode that is used in collectingtime_seriesdata. If the time_series data for that geo is not collected beforehand, or the file does not exist, it will throw and error. Default is 'US'
- This is the same
zero_replace- As data from different time periods are rescaled, sometimes the last/first data point of a period might be zero. Then the calculation will throw error or everything single data point will become zero. To avoid that, we are tweaking the zeroes to be of an insignificant number to carry on with the calculation.
convert_cross_section
day_data.convert_cross_section(reference_geo='US-AL', zero_replace=0.1)
This final method will rescale the cross section data based on the concatenated time series data. This will finally provide the accurate google trends index for each region/country/city over the provided time period.
Parameters
reference_geo- Same as the reference_geo from
concat_time_series(). If anyother is used, then the result will not be accurate
- Same as the reference_geo from
zero_replace- Same as zero_replace from
concat_time_series(). It is highly recommended to use the same to avoid incosistent results.
- Same as zero_replace from
all_in_one_method
day_data.all_in_one_method(geo='US', reference_geo='US-AL', zero_replace=0.1)
This last method combines all the methods together and executes them in the correct sequence. It will collect the cross_section & time_series data, concat the time_series data and finally rescale the cross section data all in one go. All the files will be present for cross reference.
Note that the sequence of the first two methods cross_section() & time_series() don't matter since they are independent. However, the later two are depended on the first two. concat_time_series() is depended on time_series() and convert_cross_section() is depended on both concat_time_series() and cross_section().
Parameters
geo- Same as
geofromcross_section()
- Same as
reference_geo- Same as
reference_geofromtime_series()andconcat_time_series()
- Same as
zero_replace- Same as
zero_replacefromconcat_time_series()andconvert_cross_section()
- Same as
Caveats
This is not an Official or Supported API.
pytrends_longitudinal is built on top of pytrends. pytrends uses Google Trends API to collect trends data. So we do not have any control over the accuracy or quality of the trends data. It has been observed during tests that for the same inputs (keyword, topic, data_format, timeline), outputs were little different.
zero_replace is used to avoid division errors. But when the zero_replace is very small number, and there are a lot of zeroes in the dataset, then the final output will contain very big numbers. However, there is no specific rule or recommendation for the zero_replace. Its gonna be a trial & error.
On that note, if the search term is not very popular, then the resultant dataset will contain a lot of zeroes that will hugely impact the final outcome.
Credits
pytrendslibrary
Acknowledgement
This publication was made possible by the generous support of the Qatar Foundation through Carnegie Mellon University in Qatar's Seed Research program. The statements made herein are solely the responsibility of the authors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytrends_longitudinal-0.1.9.tar.gz.
File metadata
- Download URL: pytrends_longitudinal-0.1.9.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
432ea2379990946d842061b1125f1131b6dc1926e1b9dead602038c03f0741fe
|
|
| MD5 |
b2c0961a5d63483a2ff5fa122cdf0b1c
|
|
| BLAKE2b-256 |
272afaa645607ceef5666044c15421fbc14e29b735120abdc26973593c54e353
|
File details
Details for the file pytrends_longitudinal-0.1.9-py3-none-any.whl.
File metadata
- Download URL: pytrends_longitudinal-0.1.9-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f72429509392e962ba240173a13582e47149f024650b41e27653a8e32e486473
|
|
| MD5 |
60cbc3c17eec108dbac56e9b3c9fcb50
|
|
| BLAKE2b-256 |
d4e2bcb70a8142e3512db80235b0b08b2850839cc07bd75cefe3583fb036446b
|