Python Library for Signal Exchange

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

python-threatexchange

A Python Library to simplify the exchange and use of trust & safety information, especially media hash exchanges. It also contains a CLI called threatexchange to demonstrate the functionality.

python-threatexchange is designed to be extensible and comes with a simple model of adding new functionality.

To get similar functionality in a deployable service, check out hasher-matcher-actioner.

GitHub Workflow Status PyPI - Python Version PyPI - Downloads PyPI

Run the CLI in Docker container

A Dockerfile is provided which allows you to run the CLI with minimal dependencies.

First build the container:

$ docker build --tag threatexchange .

Then run:

$ docker run threatexchange

To persist the configuration and data between invocations, mount the /var/lib/threatexchange volume:

$ docker run --volume $HOME/.threatexchange:/var/lib/threatexchange

Installation

If you don't have pip, learn how to install it here.

$ python3 -m pip install threatexchange --upgrade

Introduction

Trust and safety is a generally hard problem. An issue that makes the problem harder is that most platforms attempt to keep their platforms safe on their own, despite bad actors and viral content spreading from platform to platform. This results not only in duplicate effort in building out technical capability to detect harmful content, but also duplicate effort in preventing the spread of known harmful content, since each platform is fighting potentially the same copies of content on their own.

One technique that can allow platforms to combine efforts to combat harm is by sharing signatures of content that they have already detected on their own, or the inputs to various trust and safety tools that can be used to find harmful content. The most well-known are photo/video hash sharing programs like those operated by the National Center for Exploited and Missing Children (NCMEC) and the Global Internet Forum to Counter Terrorism, Southwest Grid for Learning’s StopNCII.org and Meta's ThreatExchange platform.

The python-threatexchange library aims to simplify the exchange of signals via platforms like the above, as well as provide a baseline of functionality available to simplify the testing and creation of new exchanges and techniques, as well as provide cross-compatibility.

Philosophy of the Library

This library is maintained by a small team at Meta with a limited range of experience, and so we will prioritize the use cases we are most familiar with. We believe that accessibility is a barrier for many platforms and so will put as much as we can in the open. We also understand that it may not make sense to use only publicly visible approaches, and welcome platform-specific modifications and derivatives. However, we also accept pull requests! If you think functionality is widely applicable, or you have a bug bothering you, we accept pull requests! If you are thinking a larger change may be needed (such as adding an entirely new subcommand to the CLI), we appreciate if you reach out to talk through a feature before submitting it!

Key Concepts

Below is a quick overview of the key concepts. If you dig deeper into the library, there are additional considerations that might apply if you are creating your own extensions.

basic concepts

The basic flow of data through the system is:

Configure which sources of data (signals) you want to pull from (aka collaborations)
Download from all sources
Store the signals and build an efficient matching datastructure (index)
Match content against stored signals

Collaborations

A collaboration represents a single collection of data from a single API. This often ties to practical usage such as "A1 video hashes from the NCMEC industry database" or "terrorism photo hashes from Meta's ThreatExchange only from specific applications". In cases where a platform may want to test or take different actions on matching data from one location, Collaborations provide a way to do so.

SignalType, Signals, Indices

A SignalType is the encapsulation of a technique that can be used to classify or detect content and the settings for detecting that content can be shared between platforms.

A serialization of data that can be used as an input to detect/match content is called a "signal", and this library enforces that every signal be representable has a python str class.

SignalType enforces that you provide "naive" or brute force versions of the techniques that can be used for correctness testing. By default, python-threatexchange will use a simple linear scan against these brute force methods. If there are more efficient methods for scanning large datasets, the SignalTypeIndex interface provides a place to store a more complex scaled technique.

ContentType, Content

SignalTypes are usually not globally applicable, targeting only a specific type of content such as text, images, or URLs. Additionally, some types of content can be decomposed or processed to extract additional content. Take for example a URL to a post on a social media site with an embedded video hosted on a third site with a description and thumbnail.

URL: www.example.com/post/123
              |
              +-- Text: "Look at this cool video"
              |
              +-- Photo: <thumbnail preview>
              |
              +-- URL: www.content-host.com/321.mp4
                                 |
                                 + File: 321.mp4
                                           |
                                           + Video: <bytes>
                                                       |
                                                       + Images: <frame1>, <frame2>...
                                                       |             |
                                                       |             + Text: <from OCR>
                                                       |
                                                       + Audio: <bytes>
                                                                   |
                                                                   + Text: <computer generated transcript>

ContentType is a wrapper around traversing this graph and helping find out which techniques are applicable given a given input. It may make sense to create ContentTypes specific to your platform (such as a post type), or to represent specific combinations of signals. Certain imagery may only be harmful if accompanied by certain text and vice-versa.

SignalExchangeAPI, Updates, Checkpoints, FetchedMetadata, Storage

A SignalExchangeAPI is a location that allows for the exchange of Signals. It's not expected that that every SignalExchangeAPI supports all signals, or that it is hosted by a third party - an API could just be a specific file on disk.

The interface defines how a full copy of signals for a single Collaboration can be fetched using sequential, checkpoint-able updates. It also must provide a solution for naive implementations of storage by merging a copy of the data in memory.

For some applications, the amount of data will be too large to fit in memory - in that case, a solution that can efficiently merge updates produced by the fetch() function is all that is needed.

Extensions

This library can make use of extensions provided by any party, public or private, as long as they conform to the conventions established in the library. Extensions are a way to prototype out new techniques, and quickly make them available in existing exchanges. Some exchanges, like ThreatExchange, allow sharing arbitrary data with arbitrary labels, and so a new technique can be rapidly demonstrated cross-platform even if not officially supported.

`threatexchange` CLI

The threatexchange cli is designed to rapidly demonstrate the value of the library, and if you were in a pinch, could be the basis for an end-to-end solution if needed.

Usage

While the CLI was designed for use with signal exchanges. The normal flow is roughly:

configure collaborations
fetch from APIs
build indices
match data
contribute labels and data

$ threatexchange --help  # The help should give a decent overview of functionalities

# Step 1: We can skip this step if using the sample data
$ threatexchange collab edit ...

# Step 2: This will save progress, and we'll want to rerun it to get new data periodically
$ threatexchange fetch

# Step 3: This is done by default at the end of step 2, but you can also trigger it manually
$ threatexchange dataset --rebuild-indices

# Step 4: You can match a variety of content and formats
$ threatexchange match text -- 'bball now?'
raw_text - (Sample Signals) INVESTIGATION_SEED
trend_query - (Sample Signals) INVESTIGATION_SEED
# You can also debug matching by looking at what hashes are generated:
$ threatexchange hash video example.mp4
video_md5 f09791b743c21f26a189c33b798b8e46

# Step 5: Contribute labels 
$ threatexchange label ...

Viewing Signals

TODO

$ threatexchange dataset 
$ threatexchange dataset -P --csv > out.csv

Connecting to APIs and Getting Signals

A local file

This is the fastest way to experiment with the CLI functionality and saving contents

$ threatexchange hash photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
$ threatexchange hash photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true >> ~/file.txt
$ threatexchange config collab edit local_file --filename ~/file.txt 'file.txt' --create
$ threatexchange fetch
$ threatexchange match photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
pdq - (file.txt) INVESTIGATION_SEED

ThreatExchange

If you have access to Meta's ThreatExchange, you can use the library with PrivacyGroups with threat_updates enabled.

# Step 1 - configure the default credentials
$ threatexchange config api fb_threat_exchange --access-token '<TOKEN>'
# Step 1 Alternative 1 - TX_ACCESS_TOKEN Environment variable
$ TX_ACCESS_TOKEN='<TOKEN>'
$ export TX_ACCESS_TOKEN

# Step 1 Alternative 2 - ~/.txtoken file
$ touch ~/.txtoken
$ chmod 600 ~/.txtoken
$ echo '<TOKEN>' > ~/.txtoken file

# Step 2 - import configuration
$ threatexchange config api fb_threat_exchange -L
1012185296055235 'Example Collaboration' ...
$ threatexchange config api fb_threat_exchange -I 1012185296055235

# Step 2 Alternative - manually configure via
$ threatexchange config collab edit fb_threatexchange ...

$ threatexchange fetch

NCMEC Hash API

The National Center for Exploited and Missing Children (NCMEC) hosts a number of media hash exchanges related to Child Sexual Abuse Material (CSAM). If you have an account with NCMEC and credentials, you download and use hashes from that API.

# Step 1 - configure the default credentials
$ threatexchange config api ncmec --credentials '<USER>' '<PASSWORD>'
# Step 1 Alternative 1 - TX_NCMEC_CREDENTIALS Environment variable
$ TX_NCMEC_CREDENTIALS='<TOKEN>'
$ export TX_NCMEC_CREDENTIALS

# Step 2 - set up config
# Example: NGO database only using esp=1
$ threatexchange config collab edit ncmec --create 'NCMEC NGO' --environment=NGO --only-esp 

$ threatexchange fetch

StopNCII.org

StopNCII.org allows people to upload hashes of intimate imagery/videos when someone is threatening to share them. If you are a partner with credentials, you can download and use hashes from that API.

# Step 1 - TX_ACCESS_TOKEN Environment variable - comma separated
$ TX_STOPNCII_KEYS='<FUNCTION_KEY>,<SUBSCRIPTION_KEY>'
$ export TX_STOPNCII_KEYS

# Step 1 Alternative - ~/.tx_stopncii_keys file
$ touch ~/.tx_stopncii_keys
$ chmod 600 ~/.tx_stopncii_keys
$ echo '<FUNCTION_KEY>,<SUBSCRIPTION_KEY>' > ~/.tx_stopncii_keys

# Step 2 - set up config
$ threatexchange config collab edit stop_ncii --create 'StopNCII' 

$ threatexchange fetch

Appendix

State

The CLI stores state in ~/.threatexchange. There are a few commands which will manipulate this directory, but if you need to factory reset, do rm -r ~/.threatexchange

General Expectation for Compatibility and Versioning

We strive to provide a stable library for use in production systems. To that end, we will use version numbers to help platforms which are using the threatexchange libraries in their own codebase.

Public Interfaces:

Any API used in extensions (SignalType, ContentType, etc), including their names and paths.
- Implementations of those APIs in the library (i.e. PDQSignal), (though excluding internal details of those implementations)
CLI commands and flags
- CLI output format that might be part of a pipeline (ex: threatexchange dataset -P and threatexchange match stdout)
CLI state

Private Interfaces/Internal Details:

CLI command implementations
CLI Logging/stderr
Any CLI behavior marked as unstable, prototype, or draft in its --help

Major versions (1.X.X => 2.0.0) Will have breaking changes
1. Extensions (SignalType, ContentType, SignalExchangeAPI) may not be backwards compatible
2. State (FetchedSignalMetadata, file formats): May not be compatible, but libraries or the CLI may attempt to automatically migrate if possible. Tooling to migrate state may also be available.
3. CLI: Storage formats, commands, may all have changed.
4. Library: Files may be moved or renamed
Minor versions (1.0.X => 1.1.X) May change public interfaces, but only in ways that are backwards compatible
1. Extensions: May gain new methods, or have signatures with new arguments with defaults
2. State: May be changed only if automatic migration is possible with how the CLI uses it (__setstate__ with pickle, TBD with dacite)
3. CLI: Flags may change behavior or move only if previous invocations will do the same thing (i.e. nargs could go from 1 to '*' or '+', or the flag can be renamed if a hidden alias is maintained)
4. Library: Files not in the public interface may be moved or renamed.
Revision numbers (1.0.0 => 1.0.1) will be fully backwards compatible.

The CLI as an E2E Solution

While hasher-matcher-actioner is this repository's attempt at a scaled end-to-end solution, the CLI uses the same libraries and can emulate the same functionality.

In order to do that, you'll need to solve a few problems:

Storing and potentially distributing config files
Calling threatexchange fetch periodically
Distributing the produced indices
Connecting your content pipeline to threatexchange match from those indices
Routing matches to your own tooling and infrastructure.

Unless you are doing the above on a single machine, your favorite distributed filesystem may handle most of these problems (for example, syncing a single shared ~/.threatexchange directory).

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.1.0

Dec 11, 2023

1.0.15

Oct 4, 2023

1.0.14

Oct 4, 2023

1.0.13

Sep 29, 2023

1.0.12

Sep 20, 2023

1.0.11

Sep 8, 2023

1.0.10

Feb 8, 2023

1.0.9

Dec 13, 2022

1.0.8

Dec 7, 2022

1.0.7

Nov 17, 2022

1.0.6

Oct 18, 2022

1.0.5

Oct 12, 2022

1.0.4

Oct 10, 2022

1.0.3

Sep 13, 2022

1.0.2

Aug 11, 2022

1.0.1

Aug 11, 2022

1.0.0

Aug 10, 2022

0.0.29

Dec 8, 2021

0.0.28

Nov 29, 2021

0.0.27

Oct 18, 2021

0.0.26

Oct 13, 2021

0.0.25

Oct 5, 2021

0.0.24

Sep 29, 2021

0.0.23

Jul 27, 2021

0.0.22

Jul 21, 2021

0.0.21

Jul 20, 2021

0.0.20

May 18, 2021

0.0.19

May 7, 2021

0.0.18

Apr 20, 2021

0.0.17

Apr 16, 2021

0.0.16

Mar 19, 2021

0.0.15

Mar 12, 2021

0.0.14

Mar 2, 2021

0.0.13

Feb 27, 2021

0.0.12

Feb 4, 2021

0.0.11

Dec 16, 2020

0.0.10

Dec 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

threatexchange-1.1.0.tar.gz (132.0 kB view hashes)

Uploaded Dec 11, 2023 Source

Built Distribution

threatexchange-1.1.0-py3-none-any.whl (170.6 kB view hashes)

Uploaded Dec 11, 2023 Python 3

Hashes for threatexchange-1.1.0.tar.gz

Hashes for threatexchange-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c7b09783c3ef2690ca4b4fecb3ef453492b753f101b40bc0f2314f5fe75a06fd`
MD5	`2366f85840ef912a30a0099768536173`
BLAKE2b-256	`5f9e2983d6c27829a30cbf45d970efe4e5d2849d491bc45385d943c407f72d45`

Hashes for threatexchange-1.1.0-py3-none-any.whl

Hashes for threatexchange-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`206d619fa6d1d6f2fcc2f8b6bcbb1bb2a5af83ff121478bb6fecf5c624a76332`
MD5	`231329a61ed19857f1cebfc4630f0d8e`
BLAKE2b-256	`b9b61a641aa4e0ee97878c14b0e4e44c09dbb96970a30c0cd219f598fbb7026c`

threatexchange 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

python-threatexchange

Run the CLI in Docker container

Installation

Introduction

Philosophy of the Library

Key Concepts

Collaborations

SignalType, Signals, Indices

ContentType, Content

SignalExchangeAPI, Updates, Checkpoints, FetchedMetadata, Storage

Extensions

threatexchange CLI

Usage

Viewing Signals

Connecting to APIs and Getting Signals

A local file

ThreatExchange

NCMEC Hash API

StopNCII.org

Appendix

State

General Expectation for Compatibility and Versioning

The CLI as an E2E Solution

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`threatexchange` CLI