摘要

Helitrons, eukaryotic transposable elements (TEs) transposed by rolling-circle mechanism, have been found in various species with highly variable copy numbers and sometimes with a large portion of their genomes. The impact of helitrons sequences in the genome is to frequently capture host genes during their transposition. Since their discovery, 18 years ago, by computational analysis of whole genome sequences of Arabidopsis thalianaplant and Caenorhabditis elegans(C. elegans) nematode, the identification and classification of these mobile genetic elements remain a challenge due to the fact that the wide majority of their families are non-autonomous. In C. elegansgenome, DNA helitrons sequences possess great variability in terms of length that varies between 11 and 8965 base pairs (bps) from one sequence to another. In this work, we develop a new method to predict helitrons DNA-sequences, which is particularly based on Frequency Chaos Game Representation (FCGR) DNA-images. Thus, we introduce an automatic system in order to classify helitrons families in C. elegansgenome, based on a combination between machine learning approaches and features extracted from DNA-sequences. Consequently, the new set of helitrons features (the FCGR images and K-mers) are extracted from DNA sequences. These helitrons features consist of the frequency apparition number of K nucleotides pairs (Tandem Repeat) in the DNA sequences. Indeed, three different classifiers are used for the classification of all existing helitrons families. The results have shown potential global score equal to 72.7% due to FCGR images which constitute helitrons features and the pre-trained neural network as a classifier. The two other classifiers demonstrate that theirefficiency reaches 68.7% for Support Vector Machine (SVM) and 91.45% for Random Forest (RF) algorithms using the K-mers features corresponding to the genomic sequences.