Stretch Profile: A pruning technique to accelerate DNA sequence search
摘要
DNA sequence similarity search has been used by scientists to facilitate biological research. Over the years, more sequences are added to databases, making them constantly larger. Existing sequence search techniques usually focus on the improvement of sequence search algorithms to make the search faster, ignoring the possibility of reducing unrelated sequences from the search. This paper presents a pruning technique to accelerate DNA sequence search based on a novel Stretch Profile created from stretches of consecutive base characters: A-Stretch, C-Stretch, G-Stretch, and T-Stretch. The Stretch Profile is pre-generated for each sequence in a sequence database. During the search, the Stretch Profile of the query sequence is generated for comparison. The sequences in the database whose profiles do not match the Stretch Profile of the query sequence are excluded from the search, resulting in the reduction of search space, and consequently, search time.For evaluation, we compare sequence retrievals from the Greengenes database and processing time when using only BLAST and when using the proposed pruning technique with BLAST. The results show that the proposed pruning technique can reduce the search time by 30.43% up to 63.74% depending on the length of input query, while maintaining a sensitivity of 1.00 when compared to the result of the original BLAST search.
