Analyze the reliability of a source using Wikipedia.

These details have not been verified by PyPI

Project links

Homepage

Project description

Source Trust

This tool can be used for consulting the trustworthiness of a website given publicly available information.

This tool works by consulting Wikipedia to check if a website is assocaited with categories that may question the trustworthiness of a website (i.e. Pseudoscience), and looks up a website in publicly available, curated databases of websites associated with fake news.

This tool could be further extended to do lookups on attributes like domain ownership, affiliation with organizations known to be producers of untrustworthy content, and more.

Getting Started

First, install the project from PyPi:

pip install

Then, you can use the tool as follows:

from source_trust import generate_report


DOMAIN = "ABCNews.com.co"
report = generate_report(DOMAIN.lower())

if any(len(value) > 0 for value in report.values()):
    print("Website is flagged for the following reasons:")
    for key, value in report.items():
        if key == "all_categories":
            continue

        if len(value) > 0:
            print(key + ": " + ", ".join(value))
else:
    print("Website is not flagged")

Path of a Request

Source Trust completes several steps to determine the trustworthiness of a website.

First, Source Trust opens the cache of the day. This cache includes:

The categories from all previous requests made that day.
The known problematic websites listed on Wikipedia's untrustworthy websites lists retrieved that day.

If the site has not yet been retrieved, it is retrieved and categories are extracted from the wiki page.

If the site has been retrieved, the categories are extracted from the cache.

Next, the following checks take place:

If the site is in the known problematic websites list, it is flagged.
If the site has any negative sentiment categories, a consensus algorithm is run. This algorithm uses cached categories from the last N days, if available, to determine if any problematic categories are consistent across multiple days. There are four options available:
- percent: The percentage of days the category was present in the last N days.
- majority: The category was present in the majority of the last N days.
- unanimous: The category was present in all of the last N days.
- in_one_or_more: The category was present in one or more of the last N days.
If the consensus algorithm finds a category or set of categories that meet the specified consensus method, the site is flagged.
If the consensus algorithm does not have enough data (i.e. there are not enough days in the cache), the site is flagged if it has any negative sentiment categories.

Negative sentiment is determined using a pre-trained sentence classifier available on Hugging Face.

The consensus algorithm is implemented to prevent against spam or malicious edits on Wikipedia compromising the integrity of the tool. For example, if a reputable site is given a negative sentiment category on Wikipedia (i.e. Pseudoscience), the consensus algorithm will prevent the site from being flagged as untrustworthy.

This only works if there is at least one day of data available in the cache. If there is no data available, the site will be flagged as untrustworthy if it has any negative sentiment categories.

To counter this, any extension using this tool should consider showing all categories to users directly, allowing people to make their own decisions about the trustworthiness of a site.

Limitations

Trust is considered at the level of the domain. Thus, using this tool one could derive that example.com is trustworthy, but not specifically example.com/example.

Trust is not analyzed at the subdomain level, unless a subdomain is specifically logged in a database used by this tooland noted as untrustworthy. Thus, if example.com is considered trustworthy, example.example.com would not have a ranking unless it were logged in a database used by this tool and noted as untrustworthy.

This tool is not meant to be a substitute to analyzing source material to verify the veracity and reliability of information in an article.

Example

Analysis of goop.com:

Website is flagged for the following reasons:
negative_sentiment_categories: Pseudoscience, Health fraud companies, Advertising and marketing controversies

Analysis of wordpress.com:

Website is not flagged.

Analysis of abcnews.com.co:

Website is flagged for the following reasons:
negative_sentiment_categories: Fake news websites, Defunct websites
known_problematic_websites: abcnews.com.co

Lists Consulted for Reliability Checks

See the KNOWN_LISTS variable in source_trust.py for a list of lists consulted for reliability checks.

Contributing

Have an idea on how this project can be better? Leave an Issue on the project GitHub repository. Want to contribute? Fork the project and make a pull request.

License

This project is licensed under an MIT license.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Apr 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sourcetrust-0.1.0.tar.gz (10.0 kB view hashes)

Uploaded Apr 13, 2024 Source

Built Distribution

sourcetrust-0.1.0-py3-none-any.whl (8.4 kB view hashes)

Uploaded Apr 13, 2024 Python 3

Hashes for sourcetrust-0.1.0.tar.gz

Hashes for sourcetrust-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e5162402639847d5eaf53deef3ae4cdbf9d6131c346d9547d5a25c7015b96db7`
MD5	`4f2bb5a51c326e37b32dbf5d76ccf586`
BLAKE2b-256	`b3d3eb7716ca9d6c7cb38058f0d1e4f8cfbc6b94b42c4e4582d634ff9468f399`

Hashes for sourcetrust-0.1.0-py3-none-any.whl

Hashes for sourcetrust-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d3b1f78e6ed4b94fc12bf1a88d5eb905a6e9595cf25742a8833a0b48255bd42d`
MD5	`14e0a6f73ee68d7d1aff0e0a3cb8abd3`
BLAKE2b-256	`0d1edfa66c55ebedd024d32e0e8c40e3bd1024394e082bc7142ba5ac114aee5a`