Deciding whether to count one occurrence of each motif per sequence or multiple occurrences was a crucial question e.g. the motif “GAGCTG” may appear more than once in a sequence. After testing both strategies, it was found that only counting one occurrence resulted in more accurate motif discovery. It is theorized that this is due to compositional bias e.g. the string “AAAAAAAA” (length 8) counts as three 6-mers of the “AAAAAA” motif. This resulted in many low complexity motifs, such as “GGGGGG”, being unfairly ranked higher.
Initially, the reverse complements of DNA sequences were not taken into account. When the reverse complements of sequences were combined in the algorithm, the prominence of desired motifs grew even larger.
Create your own unique website with customizable templates.