Skip to main content

DigCNV: Discriminating True CNVs from artifacts from genotyping without further visualisation

Project description

DigCNV

DigCNV logo

PyPI - License PyPI PyPI - Downloads PyPI - Wheel PyPI - Status

GitHub code size in bytes GitHub all releases GitHub issues

A machine learning model to quality control genotyped CNVs

False CNVs bring noise to analysis and could distort a diagnosis. CNV calling produce false negative and positive identifications. To remove false negative it's recommended to use multiple CNV caller at a time. So we present here, a statistical approach to clean CNV results coming from two calling algorithms, PennCNV or QuantiSNP

This machine learning can be used in two different ways:

  • a first one-line script to annotate any new CNVs on our pre-trained model. Trained on 38,000 CNVs coming from 7 different genotyping technologies. This model have AUC > 90% for most technologies (already trained-on or new technologies)
  • a second approach line by line to train and test our machine learning model on your own visualized dataset and performe some statistics.

DigCNV models

  • Most up-to-date model available at TBD

Classify CNVs with one line

# To run only once to install package
pip install digcnv

python3 -m digcnv [-v] <Path to config file>

Config file

Example of config file needed for the one line execution. Example can be download with function getConfigFileExample(output_path)

[Inputs]
pc_output_path = Path to the PennCNV output file
pc_qc_path = Path to the PennCNV microarray quality file 
qs_output_path = Path to the QuantiSNP output file


[Output]
Save_to_file = True
Output_path = /home/thomas/Documents/scripts/DigCNV/temp_data/DigCNV_pred.tsv

[DigCNV]
model_path": Path of the downloaded model. Available at : 

Run DigCNV with more options

Prepare data

from digcnv import CNVision, dataPreparation

# Import and merge CNV coming from two CNV calling algorthims (only PennCNV and QuantiSNP in this version)
cnvs = CNVision.mergeMultipleCNVCallingOutputs("<list of PennCNV and QuantiSNP output pathways>", ["PennCNV", "QuantiSNP"])

# Add microarray quality data to the list of CNVs
cnvs = dataPreparation.addMicroArrayQualityData(cnvs, "<path to the PennCNV microarray quality file>")

# Compute derived features and add it to the CNV list
cnvs = dataPreparation.addDerivedFeatures(cnvs)

# Add Chromosomic information such as centromere and Segmental Duplications overlap
cnvs = dataPreparation.addChromosomicAnnotation(cnvs, parameters["centromeres"], parameters["seg_dups"])

cnvs = dataPreparation.transformTwoAlgsFeatures(cnvs)
 

Run DigCNV from the pre-trained model

More information in "article--WIP"

from digcnv import digCnvModel, dataVerif

# Create an empty DigCNV model
model = digCnvModel.DigCnvModel()

# Open pre-trained model and update object
model_path = join(split(__file__)[0], 'data', 'DigCNV_model_multiple_technos.pkl')
model.openPreTrainedDigCnvModel(model_path)

# Check if mandatory columns for the DigCNV model exist and have right formats
dataVerif.checkIfMandatoryColumnsExist(cnvs, post_data_preparation=True)
dataVerif.checkColumnsformats(cnvs, post_data_preparation=True)

# Optional plot a correlation heatmap between different predictors used in model
dataVerif.plotCorrelationHeatMap(cnvs, list_dim=model._dimensions, output_path="<Pathway where output plot (.pdf or .png)>")

# Check NaN data within mandatory columns and split data into two dataframes: first for CNVs with all information available
# and a second one with all CNVs with at least one missing data (can't be used for prediction)
cnvs, cnvs_with_na = dataVerif.computeNaPercentage(cnvs, dimensions=model._dimensions, remove_na_data=True)

# Discriminate true from false CNVs from CNVs with all data, then produce a list of classes
predicted_cnvs = model.predictCnvClasses(cnvs)
cnvs["DigCNVpred"] = predicted_cnvs

Train your own DigCNV model

from digcnv import digCnvModel, DigCnvPreProcessing
# Uses CNVs created in Prepare data section :

# Add your own annotation to your CNVs (0 for false CNVs and 1 for true)
cnvs["visualized_class"] = ["<classes of each CNV visualized>"]

# Dimensions used to classify CNVs (Recommended dimensions you can use your own)
predictors = ["",""]



# Remove CNVs with at least one missing values in used predictors or in visualized column
cnvs, removed = dataVerif.computeNaPercentage(data, dimensions=predictors + ["visualized_class"], remove_na_data=True)


# Split dataset into two groups a training dataset and a testing dataset (70% - 30%)
X_train, y_train, X_test, y_test = DigCnvPreProcessing.createTrainingTestingDatasets(cnvs, dimensions=predictors, X_dimension="visualized_class")

# If ratio between the two classes is too unbalanced uniformize classes by split majoritary class and adding new pseudo CNVs to minority class
X_train, y_train = DigCnvPreProcessing.uniformizeClassesSizes(X_train, y_train, 17, 0.4, 0.5)

# Create a DigCNV model
model = digCnvModel.DigCnvModel()
model.createDigCnvClassifier()

chr20:44356194-44378577       numsnp=7      length=22,384      state2,cn=1 /path/to/finalreport/10001 startsnp=rs232258 endsnp=rs380421 conf=16.163
chr9:17583310-17622213        numsnp=21     length=38,904      state5,cn=3 /path/to/finalreport/10001 startsnp=rs1028594 endsnp=rs3808750 conf=101.052
chr10:47543322-47703613       numsnp=47     length=160,292     state5,cn=3 /path/to/finalreport/10001 startsnp=rs11259779 endsnp=rs4128664 conf=156.227
chr6:4263349-4472587          numsnp=69     length=209,239     state2,cn=1 /path/to/finalreport/10002 startsnp=rs6937085 endsnp=rs7746329 conf=120.225
chr6:80608294-80611616        numsnp=6      length=3,323       state2,cn=1 /path/to/finalreport/10002 startsnp=rs17833835 endsnp=rs1887571 conf=20.441
...

PennCNV quality output

SampleID        LRR_mean        LRR_median      LRR_SD  BAF_mean        BAF_median      BAF_SD  BAF_DRIFT       WF      GCWF
10001   -0.004
# Train the DigCNV model with the given training dataset
model.trainDigCnvModel(training_data=X_train, training_cat=y_train)

# Save the trained model into the specified path
model.saveDigCnvModelToPkl("<output_path>")

Input files format example

PennCNV output

chr20:44356194-44378577       numsnp=7      length=22,384      state2,cn=1 /path/to/finalreport/10001 startsnp=rs232258 endsnp=rs380421 conf=16.163
chr9:17583310-17622213        numsnp=21     length=38,904      state5,cn=3 /path/to/finalreport/10001 startsnp=rs1028594 endsnp=rs3808750 conf=101.052
chr10:47543322-47703613       numsnp=47     length=160,292     state5,cn=3 /path/to/finalreport/10001 startsnp=rs11259779 endsnp=rs4128664 conf=156.227
chr6:4263349-4472587          numsnp=69     length=209,239     state2,cn=1 /path/to/finalreport/10002 startsnp=rs6937085 endsnp=rs7746329 conf=120.225
chr6:80608294-80611616        numsnp=6      length=3,323       state2,cn=1 /path/to/finalreport/10002 startsnp=rs17833835 endsnp=rs1887571 conf=20.441
...

PennCNV quality output

SampleID        LRR_mean        LRR_median      LRR_SD  BAF_mean        BAF_median      BAF_SD  BAF_DRIFT       WF      GCWF
10001   -0.0045 0.0000  0.1474  0.5028  0.5000  0.0268  0.000036        -0.0141 -0.0002
10002   0.0056  0.0000  0.1588  0.5027  0.5000  0.0259  0.000000        0.0194  0.0018
10003   -0.0090 0.0000  0.1564  0.5031  0.5000  0.0308  0.000000        0.0132  0.0053
10004   0.0014  0.0000  0.1494  0.5032  0.5000  0.0280  0.000000        0.0184  0.0018
10005   0.0031  0.0000  0.1412  0.5025  0.5000  0.0272  0.000000        0.0170  0.0001
...

QuantiSNP output

Sample Name     Chromosome      Start Position (bp)     End Position (bp)       Start Probe ID  End Probe ID    Length (bp)     No. Probes      Copy Number     Max. Log BF     Log BF: State 0 Log BF: State 1 Log BF: State 2 Log BF: State 3     Log BF: State 4 Log BF: State 5 Log BF: State 6
10001   1              31943355        31943355 rs7545865       rs7545865                     1 1       1       1.2152  -24.0258        -2.5507 0       1.2152  -16.1517        -19.4123        -25.013
10001   1             111930916       111934304 rs12077338      rs4839132                  3389 4       1       4.57286 -42.9409        4.57286 0       2.93527 -29.2398        -35.802 -44.9483
10001   2              44082362        44096010 rs6718187       rs6752551                 13649 4       3       0.954579        -62.543 -41.1676        0       -14.003 0.954579        0.576184        -60.5049
10001   2              92308395        92308395 rs4509760       rs4509760                     1 1       1       0.0218224       -26.3579        -4.38718        0       0.0218224       -19.0042        -21.9322        -27.2609
10001   3              59820539        59821071 rs1905866       rs17362486                  533 2       4       0.33986 -9.54639        -22.8223        0       -3.94108        -6.15468        0.33986 -8.45724
...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digcnv-0.5.2.tar.gz (179.0 kB view details)

Uploaded Source

Built Distribution

digcnv-0.5.2-py3-none-any.whl (163.8 kB view details)

Uploaded Python 3

File details

Details for the file digcnv-0.5.2.tar.gz.

File metadata

  • Download URL: digcnv-0.5.2.tar.gz
  • Upload date:
  • Size: 179.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for digcnv-0.5.2.tar.gz
Algorithm Hash digest
SHA256 96932a36d60a4dbb49246456aa43b16ff581806c93b46fd43c2951d1ea04ac13
MD5 61b993b1307e853ee6b1f65cb3e03fb4
BLAKE2b-256 0bbf194eade9fb1dffbcbba5fd4e35b17632ff0a33b40d087549f678643c90ec

See more details on using hashes here.

File details

Details for the file digcnv-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: digcnv-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 163.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for digcnv-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 61e63867f57be3b59250681b3f8d9d31d291cfb9f2ba474da2505e2072af22f4
MD5 99fdcfd56e5a07cbd4d161da1eaae089
BLAKE2b-256 f8332d53cdc81fefd393d109043ab8c4c8b521245ad9e9372ded410e282fbb7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page