A package providing tools for the SKA Science Data Challenges.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Science Data Challenge Scoring Code API

The SKA Science Data Challenge #1 (https://astronomers.skatelescope.org/ska-science-data-challenge-1/) tasked participants with identifying and classifying sources in synthetic radio images.

In addition to the synthetic images, participants were provided with a section of the 'truth catalogue' of sources used to generate the artificial data. Comparing the truth catalogue with the 'submission catalogue' produced by a participant's solution would provide a means of determining the success of the solution.

To evaluate the accuracy of the results, a program was developed to cross-match sources between the submission and truth catalogues, and calculate a 'score' based on the result of this cross-match.

This is an open-source implementation of the program used to score and rank the submissions for the first SKA Science Data Challenge (SDC). A number of improvements have been made, most notably the use of a more performant cross-match algorithm. As such it is not possible to make a direct comparison between the scores produced by this package and the original program. The original IDL code is available at: https://astronomers.skatelescope.org/ska-science-data-challenge-1/

SDC1: Scoring a submission

Scoring a submission for SDC1 is provided through the Sdc1Scorer class. Typically, this is instantiated with two pandas.DataFrame objects (corresponding to the submission and truth catalogues), and the corresponding image frequency (560, 1400 or 9200 MHz). A method is also available to construct an SdcScorer instance from the paths to the two catalogues.

Once the SdcScorer has been instantiated, its run method will run the scoring pipeline and evaluate a result.

The result is provided via a Sdc1Score object, which has properties providing feedback about the submission catalogue.

Below are several examples which should illustrate the API useage.

Catalogue schema

The catalogues must conform to the schema specified in the competition rules; as a summary, the expected columns for the DataFrames are:

CAT_COLUMNS = [
    "id",
    "ra_core",
    "dec_core",
    "ra_cent",
    "dec_cent",
    "flux",
    "core_frac",
    "b_maj",
    "b_min",
    "pa",
    "size",
    "class",
]

These are the column names which will be applied if reading the catalogues from a file. Catalogue files should be space-delimited tables, and header rows should be ignored using the flags in the Sdc1Scorer.from_txt() method.

Example 1: Scoring catalogue files

from ska_sdc import Sdc1Scorer

sub_cat_path = "/path/to/submission/catalogue.txt"
truth_cat_path = "/path/to/truth/catalogue.txt"

scorer = Sdc1Scorer.from_txt(
    sub_cat_path,
    truth_cat_path,
    freq=1400,
    sub_skiprows=1,
    truth_skiprows=0
)
scorer.run()
print("Final score: {}".format(scorer.score.value))

Note the optional skiprows keyword arguments which can be used to specify rows to be skipped when reading the file (e.g. header rows)

Example 2: Scoring catalogue DataFrames

If the catalogues are already DataFrame objects, the scorer can be instantiated from these directly as follows:

from ska_sdc import Sdc1Scorer

scorer = Sdc1Scorer(
    sub_cat_df,
    truth_cat_df,
    freq=1400,
)
scorer.run()

Example 3: Using Sdc1Scorer.run optional arguments

scorer.run(
    mode=0, # 0, 1 for core, centroid position modes respectively
    train=False, # True to score based on training area only, else exclude
    detail=False, # True to return per-source scores and match catalogue
)

Sdc1Score properties

The Sdc1Score object has a range of properties in addition to the score value, as follows:

value: numerical score value (score_det - n_false)
n_det: total number of detected sources in submission
n_match: number of detections that were matched to truth sources
n_bad: number of matched detections that failed to meet acceptance threshold
n_false: number of detections that were not matched to truth sources
score_det: total score for all matched sources
acc_pc: accuracy percentage for matched sources
scores_df: DataFrame of individual source scores for each property
match_df: DataFrame of matched sources with corresponding truth sources

SDC1 scoring pipeline description

This is a brief overview of the stages of the scoring pipeline.

Stage 1: prep

A new column corresponding to the log(flux) is created for each catalogue dataframe.
The area corresponding to the training dataset is removed from each catalogue, unless the train=True is passed to Sdc1Scorer.run, in which case only the training area will be selected.
Additional features required by the catalogue cross-match step are calculated. The first such feature is the primary beam correction factor, which accounts for off-axis sources being apparently fainter than sources closer to the beam centre. In addition to this, the convolved size property estimates the apparent detected source size; this is significant for small/point-like sources where otherwise the small positional error could mean matches are spuriously ignored.

Stage 2: crossmatch

A positional crossmatch is performed using a k-dimensional tree space partitioning structure. All truth catalogue sources within a radius of each submitted source's convolved size are identified as candidate matches.

Stage 3: sieve

For each source's candidate matches, select the best by considering the difference in flux and source size.

Stage 4: create_score

Reject (but count) all matches that lie more than 5 sigma from the corresponding truth source (when considering position, flux and size).
For each matched source, calculate the accuracy of the measured properties and from these to generate a total score. Each matched source can contribute up to a score of 1.0 to the total score. Penalise for incorrectly identified sources, by subtracting the number of unmatched sources from the total score. This yields the final score.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.0.0

Sep 8, 2021

This version

1.0.0

Aug 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ska_sdc-1.0.0.tar.gz (18.1 kB view hashes)

Uploaded Aug 13, 2020 Source

Built Distribution

ska_sdc-1.0.0-py3-none-any.whl (26.9 kB view hashes)

Uploaded Aug 13, 2020 Python 3

Hashes for ska_sdc-1.0.0.tar.gz

Hashes for ska_sdc-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`4f8428e50616dda5c161ac0302d71e1370cd8729c88b10517b49ac3d7d963ec5`
MD5	`3a5ed4c2fb410d38643d938be3e78130`
BLAKE2b-256	`2d1915a7d34289e8342f5dc06be444828986956db0027e21add37f88f14a4140`

Hashes for ska_sdc-1.0.0-py3-none-any.whl

Hashes for ska_sdc-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6d1eb8cbb00135cb06e9556e8af4fef5b57251bf6d495d2b1ef5bb496a9129c`
MD5	`977af1180ec268ccd403c6aac8cff2a9`
BLAKE2b-256	`a280309cbd0f0e65b05bb18b886f60a8a7abb82848b641050804f307b889ec3d`