Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data
Project description
SEQLinkage
Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data
Features
- It can do linkage analysis on single variants and CHP markers.
- It can deal with families from different population.
- It can handle large-scale whole-genome linkage analysis.
Pre-requisites
Make sure you install the pre-requisited before running seqlink:
#install cstatgen
conda install -c conda-forge xeus-cling
conda install -c anaconda swig
conda install -c conda-forge gsl
pip install egglib
git clone https://github.com/statgenetics/cstatgen.git
cd cstatgen
python setup.py install
#install paramlink2
R
install.packages("paramlink2")
Install
pip install SEQLinkage
How to use
!seqlink --help
usage: seqlink [-h] [--chp] --fam FILE --vcf FILE [--anno FILE] [--pop FILE]
[--included-vars FILE] [-b FILE] [-c P] [-o Name]
[--build STRING] [--freq INFO] [--chrom-prefix STRING]
[--run-linkage] [-K FLOAT] [--moi STRING] [-W FLOAT] [-M FLOAT]
[--theta-max FLOAT] [--theta-inc FLOAT]
SEQLinkage V2, linkage analysis using sequence data
options:
-h, --help show this help message and exit
Collapsed haplotype pattern method arguments:
--chp Generate CHP markers.
--fam FILE Input pedigree and phenotype information in FAM
format.
--vcf FILE Input VCF file, bgzipped.
--anno FILE Input annotation file from annovar.
--pop FILE Input two columns file, first column is family ID,
second column population information.
--included-vars FILE Variants to be included for linkage analysis, if None,
the analysis won't filter any variants. But you can
still set AF cutoff by -c argment.
-b FILE, --blueprint FILE
Blueprint file that defines regional marker (format:
"chr startpos endpos name avg.distance male.distance
female.distance").
-c P, --maf-cutoff P MAF cutoff to define variants to be excluded from
analyses. this is useful, if you need to analyse
multiple population together.
-o Name, --output Name
Output name prefix.
--build STRING Reference genome version for VCF file.
--freq INFO Info field name for allele frequency in VCF file.
--chrom-prefix STRING
Prefix to chromosome name in VCF file if applicable,
e.g. "chr".
LINKAGE options:
--run-linkage Perform Linkage analysis.
-K FLOAT, --prevalence FLOAT
Disease prevalence. Default to 0.001.
--moi STRING Mode of inheritance, AD/AR: autosomal
dominant/recessive. Default to AD.
-W FLOAT, --wt-pen FLOAT
Penetrance for wild type. Default to 0.01.
-M FLOAT, --mut-pen FLOAT
Penetrance for mutation. Default to 0.9.
--theta-max FLOAT Theta upper bound. Default to 0.5.
--theta-inc FLOAT Theta increment. Default to 0.05.
Linkage analysis on specific regions
Normally, the regions are gene regions. you can also use self-defined regions, such as promoter regions, enhancer regions.
1.run seqlink on CHP marker
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp --chp --run-linkage
[1;40;32mMESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam][0m
[1;40;32mMESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_chp', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)[0m
[1;40;32mMESSAGE: 18 samples found in FAM file but not in VCF file:[0m
[1;40;32mMESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz][0m
[1;40;32mMESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...[0m
[1;40;32mMESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units[0m
SNVHap MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
[1;40;32mMESSAGE: write to pickle: data/test_chp/chr1result/chr1result0.pickle,Gene number:4,Time:0.00012452216405007575[0m
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.3767304867506027
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
[1;40;32mMESSAGE: ============= Finish analysis ==============[0m
2.run seqlink on variants
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_var --run-linkage
[1;40;32mMESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam][0m
[1;40;32mMESSAGE: Namespace(chp=False, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_var', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)[0m
[1;40;32mMESSAGE: 18 samples found in FAM file but not in VCF file:[0m
[1;40;32mMESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz][0m
[1;40;32mMESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...[0m
[1;40;32mMESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units[0m
[1;40;32mMESSAGE: write to pickle: data/test_var/chr1result/chr1result0.pickle,Gene number:4,Time:3.229235919813315e-05[0m
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.6164871137589216
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
[1;40;32mMESSAGE: ============= Finish analysis ==============[0m
No annotation
If you don't have the annotation file. there is no need to add
--pop
. And--freq
should be setted based on theINFO
column in vcf file.
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --freq='AF' --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp_na --chp --run-linkage
[1;40;32mMESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam][0m
[1;40;32mMESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno=None, pop=None, included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_chp_na', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)[0m
[1;40;32mMESSAGE: 18 samples found in FAM file but not in VCF file:[0m
[1;40;32mMESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz][0m
[1;40;32mMESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...[0m
[1;40;32mMESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units[0m
SNVHap MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
[1;40;32mMESSAGE: write to pickle: data/test_chp_na/chrallresult/chrallresult0.pickle,Gene number:4,Time:6.505574979301956e-05[0m
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.input
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.lods
0.3394373469054699
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.besthlod
[1;40;32mMESSAGE: ============= Finish analysis ==============[0m
Whole-genome linkage analysis
if
--blueprint
is not provided, the genomic region will be seperated to pseudogenes with 1000 variants.
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt -c 0.05 -o data/test_chp_wg --chp --run-linkage
[1;40;32mMESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam][0m
[1;40;32mMESSAGE: Generate regions by annotation[0m
[1;40;32mMESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint=None, maf_cutoff=0.05, output='data/test_chp_wg', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)[0m
[1;40;32mMESSAGE: 18 samples found in FAM file but not in VCF file:[0m
[1;40;32mMESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz][0m
[1;40;32mMESSAGE: separate chromosome to regions[0m
[1;40;32mMESSAGE: 6 families with a total of 18 samples will be scanned for 1 pre-defined units[0m
[1;40;32mMESSAGE: write to pickle: data/test_chp_wg/chr1result/chr1result0.pickle,Gene number:1,Time:0.00011511449753824207[0m
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.23506776429712772
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
[1;40;32mMESSAGE: ============= Finish analysis ==============[0m
Input format
--fam
, Fam file (required, format: "fid iid fathid mothid sex trait[1 control, 2 case, -9 or 0 missing]")
%%writefile testdata/test_ped.fam
1033 1033_2 0 0 2 -9
1033 1033_1 0 0 1 -9
1033 1033_99 1033_1 1033_2 2 1
1033 1033_9 1033_1 1033_2 2 1
1033 1033_3 1033_1 1033_2 2 2
1036 1036_99 1036_1 1036_2 2 2
1036 1036_6 0 0 1 2
1036 1036_1 0 0 1 -9
1036 1036_3 1036_6 1036_99 2 1
1036 1036_4 1036_6 1036_99 2 1
1036 1036_2 0 0 2 -9
1036 1036_5 1036_6 1036_99 1 1
10J_103 10J_103_10 0 0 1 -9
10J_103 10J_103_4 0 0 1 -9
10J_103 10J_103_3 0 0 2 -9
10J_103 10J_103_2 10J_103_4 10J_103_3 2 2
10J_103 10J_103_1 10J_103_10 10J_103_3 1 2
10J_109 10J_109_2 10J_109_6 10J_109_5 1 2
10J_109 10J_109_3 10J_109_6 10J_109_5 1 2
10J_109 10J_109_4 10J_109_6 10J_109_5 1 2
10J_109 10J_109_6 0 0 1 -9
10J_109 10J_109_1 10J_109_6 10J_109_5 1 2
10J_109 10J_109_5 0 0 2 2
10J_109 10J_109_7 10J_109_6 10J_109_5 1 1
10J_112 10J_112_3 0 0 1 1
10J_112 10J_112_5 10J_112_3 10J_112_2 1 2
10J_112 10J_112_1 10J_112_3 10J_112_2 2 1
10J_112 10J_112_7 10J_112_3 10J_112_2 1 1
10J_112 10J_112_2 0 0 2 2
10J_119 10J_119_2 0 0 1 1
10J_119 10J_119_5 0 0 2 1
10J_119 10J_119_4 0 0 1 1
10J_119 10J_119_6 10J_119_4 10J_119_5 1 2
10J_119 10J_119_7 10J_119_4 10J_119_5 2 2
10J_119 10J_119_1 10J_119_4 10J_119_5 2 2
10J_119 10J_119_3 10J_119_2 10J_119_1 1 1
Overwriting ../testdata/test_ped.fam
--vcf
, VCF file (required, vcf.gz with vcf.gz.tbi)
bgzip -c file.vcf > file.vcf.gz
tabix -p vcf file.vcf.gz
--anno
, Annotation file fromANNOVAR
, It must contains the allele frequency for population in the file of family population information. For example in here, The annotation file must have AF_amr, AF_afr, AF_nfe columns.
anno=pd.read_csv('testdata/test_chr1_anno.csv')
anno
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Chr | Start | End | Ref | Alt | Func.refGene | Gene.refGene | GeneDetail.refGene | ExonicFunc.refGene | AAChange.refGene | ... | CLNDISDB | CLNREVSTAT | CLNSIG | DN ID | Patient ID | Phenotype | Platform | Study | Pubmed ID | Otherinfo1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10140 | 10147 | ACCCTAAC | A | intergenic | NONE;DDX11L1 | dist=NONE;dist=1727 | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:10140:ACCCTAAC:A |
1 | 1 | 10146 | 10147 | AC | A | intergenic | NONE;DDX11L1 | dist=NONE;dist=1727 | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:10146:AC:A |
2 | 1 | 10146 | 10148 | ACC | * | . | . | . | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:10146:ACC:* |
3 | 1 | 10150 | 10151 | CT | C | intergenic | NONE;DDX11L1 | dist=NONE;dist=1723 | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:10150:CT:C |
4 | 1 | 10172 | 10177 | CCCTAA | C | intergenic | NONE;DDX11L1 | dist=NONE;dist=1697 | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:10172:CCCTAA:C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 1 | 66479 | 66487 | TATTTATAG | * | . | . | . | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:66479:TATTTATAG:* |
996 | 1 | 66480 | 66481 | AT | A | intergenic | FAM138A;OR4F5 | dist=30399;dist=2610 | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:66480:AT:A |
997 | 1 | 66480 | 66483 | ATTT | * | . | . | . | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:66480:ATTT:* |
998 | 1 | 66481 | 66488 | TTTATAGA | T | intergenic | FAM138A;OR4F5 | dist=30400;dist=2603 | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:66481:TTTATAGA:T |
999 | 1 | 66481 | 66488 | TTTATAGA | * | . | . | . | . | . | ... | . | . | . | . | . | . | . | . | . | chr1:66481:TTTATAGA:* |
1000 rows × 152 columns
anno.loc[:,['AF_amr', 'AF_afr', 'AF_nfe']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
AF_amr | AF_afr | AF_nfe | |
---|---|---|---|
0 | 0.0006 | 0.0008 | 0.0007 |
1 | 0.6380 | 0.6300 | 0.6413 |
2 | . | . | . |
3 | 0.0357 | 0.0426 | 0.0370 |
4 | 0.0086 | 0.0097 | 0.0084 |
... | ... | ... | ... |
995 | . | . | . |
996 | 0.0039 | 0.0036 | 0.0076 |
997 | . | . | . |
998 | 0.0163 | 0.0410 | 0.0279 |
999 | . | . | . |
1000 rows × 3 columns
Or, create a self-defined annotation file like this:
Chr Start AF_amr AF AF_nfe AF_afr
chr1:10140:ACCCTAAC:A 1 10140 0.0006 0.0007 0.0007 0.0008
chr1:10146:AC:A 1 10146 0.638 0.6328 0.6413 0.63
chr1:10150:CT:C 1 10150 0.0357 0.0375 0.037 0.0426
chr1:10172:CCCTAA:C 1 10172 0.0086 0.0082 0.0084 0.0097
chr1:10178:CCTAA:C 1 10178 0.5 0.3333 0.2955 0.4375
chr1:10198:TAACCC:T 1 10198 0.0 0.0 0.0 0.0
chr1:10231:C:A 1 10231 0.2 0.0366 0.0 0.05
chr1:10236:AACCCT:A 1 10236 0.0 0.0 0.0 0.0
chr1:10247:TAAACCCTA:T 1 10247 0.2222 0.2089 0.1429 0.4211
The index must match with the ID in vcf file.
--pop
, The file of family population information
%%writefile testdata/test_fam_pop.txt
1033 AF_amr
1036 AF_amr
10J_103 AF_afr
10J_109 AF_nfe
10J_112 AF_nfe
10J_119 AF_nfe
Writing ../testdata/test_fam_pop.txt
--included-vars
, The file with one column of variants For example:
chr1:10140:ACCCTAAC:A
chr1:10172:CCCTAA:C
chr1:10198:TAACCC:T
chr1:10236:AACCCT:A
chr1:10261:T:TA
chr1:10262:AACCCT:A
--blueprint
, The blueprint file that defines regional marker (format: "chr startpos endpos name avg.distance male.distance female.distance"). The first four columns are required.
%%writefile testdata/test_blueprint_ext.txt
1 11868 14362 LOC102725121@1 9.177127474362311e-07 1.1657192989882668e-06 6.814189157634088e-07
1 11873 14409 DDX11L1 9.195320788455595e-07 1.1680302941673515e-06 6.82769803434766e-07
1 14361 29370 WASH7P 1.5299877409602128e-06 1.94345806118021e-06 1.136044574393209e-06
1 17368 17436 MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1 1.217692507120495e-06 1.5467668502473368e-06 9.041595098829462e-07
1 30365 30503 MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1 2.1295973889038703e-06 2.705108741548526e-06 1.5812659765416382e-06
1 34610 36081 FAM138A@1,FAM138C@1,FAM138F@1 2.4732411024120156e-06 3.1416201771056266e-06 1.8364278747737466e-06
1 69090 70008 OR4F5 4.866641545668504e-06 6.181823219621424e-06 3.6135725636621673e-06
1 134772 140566 LOC729737 9.633289838108921e-06 1.2236630588823159e-05 7.152898262617822e-06
1 490755 495445 LOC100132062@1,LOC100132287@1 2.2828130832833112e-05 2.8997300893994373e-05 1.6950315013571593e-05
1 450739 451678 OR4F16@1,OR4F29@1,OR4F3@1 2.575942360468604e-05 3.2720758549649544e-05 1.912685483821856e-05
1 627379 629009 LOC101928626 3.943568768003252e-05 5.009295373297854e-05 2.9281737249900675e-05
1 632614 632685 MIR12136 3.974742311959244e-05 5.048893386847169e-05 2.9513206656665908e-05
Overwriting testdata/test_blueprint_ext.txt
Or
%%writefile testdata/test_blueprint.txt
1 11868 14362 LOC102725121@1
1 11873 14409 DDX11L1
1 14361 29370 WASH7P
1 17368 17436 MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
1 30365 30503 MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1
1 34610 36081 FAM138A@1,FAM138C@1,FAM138F@1
1 69090 70008 OR4F5
1 134772 140566 LOC729737
1 490755 495445 LOC100132062@1,LOC100132287@1
1 450739 451678 OR4F16@1,OR4F29@1,OR4F3@1
1 627379 629009 LOC101928626
1 632614 632685 MIR12136
Overwriting testdata/test_blueprint.txt
Output format
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for SEQLinkage-2.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83712aa6b7be6ade5c5170d999a74247b2ee45439ba4abc5499ad3fa0714ba21 |
|
MD5 | 3fb095311c156204944aa58f093245ca |
|
BLAKE2b-256 | cc409cd89455a1fc7a5afcd2d18acdbfd86a873e937d5629ccad937d54714311 |