摘要

An algorithm for identifying haplotype heterogeneity in cancer genomes is proposed to consider somatic mutational events carried by multiple sub-clones.The algorithm is based on the genomic sequencing data with multiple libraries of tumor tissue and extracts the features from both the multi-library and the constraints of paired-end reads.A priori number of sub-clones is roughly estimated by clustering the allelic variant frequency of each somatic loci.A contig-andextension algorithm is designed,and the haplotype sequences are assembled by traversing the reads mapping to the loci.Thus,the contigs present an identification resolution on base-pair level.The number and proportion of sub-clones and the evolution relationships among them are further estimated by maximizing the likelihood of the posterior probabilities.Simulation results show that the algorithm reaches 99 % in accuracy when the sequencing based library satisfies some coverage.The proposed algorithm outperforms the existing two-stage pipeline,which is widely used in data analysis now.

  • 单位
    西安交通大学; 辽宁医学院

全文