A python package to scrape data from Ghana News Portals
Project description
GhanaNews Scraper
A simple unofficial python package to scrape data from GhanaWeb, MyJoyOnline, DailyGraphic, CitiBusinessNews, YenGH, 3News, MyNewsGh, PulseGh Affiliated to Bank of Ghana Fx Rates and GhanaShops-Scraper
NOTE: This library may keep changing due to changes to the respective websites.
How to install
pip install ghananews-scraper
Example Google Colab Notebook
Warning: DO NOT RUN GHANAWEB CODE IN ONLINE Google Colabs)
Some GhanaWeb Urls:
urls = [
"https://www.ghanaweb.com/GhanaHomePage/regional/"
"https://www.ghanaweb.com/GhanaHomePage/editorial/"
"https://www.ghanaweb.com/GhanaHomePage/health/"
"https://www.ghanaweb.com/GhanaHomePage/diaspora/"
"https://www.ghanaweb.com/GhanaHomePage/tabloid/"
"https://www.ghanaweb.com/GhanaHomePage/africa/"
"https://www.ghanaweb.com/GhanaHomePage/religion/"
"https://www.ghanaweb.com/GhanaHomePage/NewsArchive/"
"https://www.ghanaweb.com/GhanaHomePage/business/"
"https://www.ghanaweb.com/GhanaHomePage/SportsArchive/"
"https://www.ghanaweb.com/GhanaHomePage/entertainment/"
"https://www.ghanaweb.com/GhanaHomePage/africa/"
"https://www.ghanaweb.com/GhanaHomePage/television/"
]
Outputs
- All outputs will be saved in a
.csv
file. Other file formats not yet supported.
Usage
from ghanaweb.scraper import GhanaWeb
url = 'https://www.ghanaweb.com/GhanaHomePage/politics/'
# url = "https://www.ghanaweb.com/GhanaHomePage/NewsArchive/"
# url = 'https://www.ghanaweb.com/GhanaHomePage/health/'
# url = 'https://www.ghanaweb.com/GhanaHomePage/crime/'
# url = 'https://www.ghanaweb.com/GhanaHomePage/regional/'
# url = 'https://www.ghanaweb.com/GhanaHomePage/year-in-review/'
# web = GhanaWeb(url='https://www.ghanaweb.com/GhanaHomePage/politics/')
web = GhanaWeb(url=url)
# scrape data and save to `current working dir`
web.download(output_dir=None)
Scrape list of articles from GhanaWeb
from ghanaweb.scraper import GhanaWeb
urls = [
'https://www.ghanaweb.com/GhanaHomePage/politics/',
'https://www.ghanaweb.com/GhanaHomePage/health/',
'https://www.ghanaweb.com/GhanaHomePage/crime/',
'https://www.ghanaweb.com/GhanaHomePage/regional/',
'https://www.ghanaweb.com/GhanaHomePage/year-in-review/'
]
for url in urls:
print(f"Downloading: {url}")
web = GhanaWeb(url=url)
# download to current working directory
# if no location is specified
# web.download(output_dir="/Users/tsiameh/Desktop/")
web.download(output_dir=None)
Scrape data from MyJoyOnline
- It is recommended to use option 1. Option 2 might run into timeout issues.
- DO NOT RUN THIS IN
Google Colab NOTEBOOK
, either in .py script, visual studio code or in terminal due toselenium
package. - You may pass driver_name =
chrome
orfirefox
# Option 1.
from myjoyonline.scraper import MyJoyOnlineNews
url = 'https://myjoyonline.com/politics'
print(f"Downloading data from: {url}")
joy = MyJoyOnlineNews(url=url)
# joy = MyJoyOnlineNews(url=url, driver_name="firefox")
joy.download()
# Option 2.
from myjoyonline.scraper import MyJoyOnlineNews
urls = [
'https://myjoyonline.com/news',
'https://myjoyonline.com/politics'
'https://myjoyonline.com/entertainment',
'https://myjoyonline.com/business',
'https://myjoyonline.com/sports',
'https://myjoyonline.com/opinion',
'https://myjoyonline.com/technology'
]
for url in urls:
print(f"Downloading data from: {url}")
joy = MyJoyOnlineNews(url=url)
# download to current working directory
# if no location is specified
# joy.download(output_dir="/Users/tsiameh/Desktop/")
joy.download()
Scrape data from CitiBusinessNews
- Here are some list of publisher names:
citibusinessnews
aklama
ellen
emmanuel-oppong
nerteley
edna-agnes-boakye
nii-larte-lartey
naa-shika-caesar
ogbodu
- Note: using publisher names fetches more data than the url.
from citionline.scraper import CitiBusinessOnline
urls = [
"https://citibusinessnews.com/ghanabusinessnews/features/",
"https://citibusinessnews.com/ghanabusinessnews/telecoms-technology/",
"https://citibusinessnews.com/ghanabusinessnews/international/",
"https://citibusinessnews.com/ghanabusinessnews/news/government/",
"https://citibusinessnews.com/ghanabusinessnews/news/",
"https://citibusinessnews.com/ghanabusinessnews/business/",
"https://citibusinessnews.com/ghanabusinessnews/news/economy/",
"https://citibusinessnews.com/ghanabusinessnews/news/general/",
"https://citibusinessnews.com/ghanabusinessnews/news/top-stories/",
"https://citibusinessnews.com/ghanabusinessnews/business/tourism/"
]
for url in urls:
print(f"Downloading data from: {url}")
citi = CitiBusinessOnline(url=url)
citi.download()
# OR: scrape using publisher name
from citionline.authors import CitiBusiness
citi = CitiBusiness(author="citibusinessnews", limit_pages=4)
citi.download()
Scrape data from DailyGraphic
from graphiconline.scraper import GraphicOnline
urls = [
"https://www.graphic.com.gh/news.html",
"https://www.graphic.com.gh/news/politics.html",
"https://www.graphic.com.gh/lifestyle.html",
"https://www.graphic.com.gh/news/education.html",
"https://www.graphic.com.gh/native-daughter.html",
"https://www.graphic.com.gh/international.html"
]
for url in urls:
print(f"Downloading data from: {url}")
graphic = GraphicOnline(url=url)
graphic.download()
Scrape data from YenGH
# OPTION: 1
from yen.scrapy import YenNews
url = 'https://www.yen.com.gh/'
print(f"Downloading data from: {url}")
yen = YenNews(url=url)
yen.download()
# OPTION: 2
from yen.scrapy import YenNews
urls = [
'https://www.yen.com.gh/',
'https://yen.com.gh/politics/',
'https://yen.com.gh/world/europe/',
'https://yen.com.gh/education/',
'https://yen.com.gh/ghana/',
'https://yen.com.gh/people/',
'https://yen.com.gh/world/asia/',
'https://yen.com.gh/world/africa/',
'https://yen.com.gh/entertainment/',
'https://yen.com.gh/business-economy/money/',
'https://yen.com.gh/business-economy/technology/'
]
for url in urls:
print(f"Downloading data from: {url}")
yen = YenNews(url=url)
yen.download()
Scrape data from MyNewsGh
from mynewsgh.scraper import MyNewsGh
# scrape from multiple URLs
urls = [
"https://www.mynewsgh.com/category/politics/",
"https://www.mynewsgh.com/category/news/",
"https://www.mynewsgh.com/category/entertainment/",
"https://www.mynewsgh.com/category/business/",
"https://www.mynewsgh.com/category/lifestyle/",
"https://www.mynewsgh.com/tag/feature/",
"https://www.mynewsgh.com/category/world/",
"https://www.mynewsgh.com/category/sports/"
]
for url in urls:
print(f"Downloading data from: {url}")
my_news = MyNewsGh(url=url, limit_pages=50)
my_news.download()
# scrape from a single URL
from mynewsgh.scraper import MyNewsGh
url = "https://www.mynewsgh.com/category/politics/"
my_news = MyNewsGh(url=url, limit_pages=None)
my_news.download()
Scrape data from 3News
from threenews.scraper import ThreeNews
# DO NOT RUN ALL AUTHORS: select ONLY few
# DO NOT CHANGE THE AUTHOR NAMES
authors = [
"laud-nartey",
"3xtra",
"essel-issac",
"arabaincoom",
"bbc",
"betty-kankam-boadu",
"kwameamoh",
"fiifi_forson",
"fdoku",
"frankappiah",
"godwin-asediba",
"afua-somuah",
"irene",
"joyce-sesi",
"3news_user",
"ntollo",
"pwaberi-denis",
"sonia-amade",
"effah-steven",
"michael-tetteh"
]
for author in authors:
print(f"Downloading data from author: {author}")
three_news = ThreeNews(author=author, limit_pages=50)
three_news.download()
# OR
from threenews.scraper import ThreeNews
three = ThreeNews(author="laud-nartey", limit_pages=None)
three.download()
Scrape data from PulseGh
-
select ONLY few urls
- Note: these values may change
Category Number of Pages News 40 Entertainment 40 Business 40 Lifestyle 40 Business/Domestic 26 Business/International 40 Sports/Football 99 News/Politics 40 News/Local 40 News/World 40 News/Filla 38 Entertainment/Celebrities 40 Lifestyle/Fashion 40
from pulsegh.scraper import PulseGh
urls = [
"https://www.pulse.com.gh/news",
"https://www.pulse.com.gh/news/politics",
"https://www.pulse.com.gh/entertainment",
"https://www.pulse.com.gh/lifestyle",
"https://www.pulse.com.gh/sports",
"https://www.pulse.com.gh/sports/football",
"https://www.pulse.com.gh/business/international",
"https://www.pulse.com.gh/business/domestic",
"https://www.pulse.com.gh/business",
"https://www.pulse.com.gh/quizzes",
"https://www.pulse.com.gh/news/filla",
"https://www.pulse.com.gh/news/world"
]
for url in urls:
print(f"Downloading data from: {url}")
pulse = PulseGh(url=url, limit_pages=5)
pulse.download()
# news has 40 pages
from pulsegh.scraper import PulseGh
pulse = PulseGh(url="https://www.pulse.com.gh/news", total_pages = 40, limit_pages=20)
pulse.download()
# Sports/football has 99 pages
from pulsegh.scraper import PulseGh
pulse = PulseGh(url="https://www.pulse.com.gh/sports/football", total_pages=99, limit_pages=None)
pulse.download()
BuyMeCoffee
Credits
Theophilus Siameh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ghananews-scraper-1.0.28.tar.gz
(25.2 kB
view details)
Built Distribution
File details
Details for the file ghananews-scraper-1.0.28.tar.gz
.
File metadata
- Download URL: ghananews-scraper-1.0.28.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca30b391640fb115deff266832afc9f7c1fe4959dac6ec7f056f3a6b9ff0321c |
|
MD5 | e07f562944d05e19f9e3df0b53b58081 |
|
BLAKE2b-256 | b99965c93752cbeb94499d700a4b8ebcca19f194d98011c2fb998e2db0aa6ae7 |
File details
Details for the file ghananews_scraper-1.0.28-py3-none-any.whl
.
File metadata
- Download URL: ghananews_scraper-1.0.28-py3-none-any.whl
- Upload date:
- Size: 39.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aea87925109ee6ebb8ad9d174c0d0dbe349a93a471e262d732742a63e5a47f9c |
|
MD5 | a5e7fd3d280f939af3199c4685bc086f |
|
BLAKE2b-256 | 30affb7d4fd9992a1e9f39c8f524dafc0d13b368bf3d22e4c71acd78b06e1c5f |