Abstract

The project aims to develop a novel algorithm for finding sequence motifs. The goal of the project is to develop an algorithm that would be faster than existing algorithms without compromising accuracy. Motifs are short sequence patterns that represent fundamental units of biological function, such as recurring nitrogenous bases in DNA sequences. They encode protein, DNA, and RNA interactions, such as gene expression. Finding motifs allows biologists to predict the biological function of certain parts of the genome e.g. the NANOG transcription regulator motif has been linked to the pluripotency of embryonic stem cells. The new algorithm is centered around the concept of comparing a biological sequence with its randomly shuffled counterpart to identify significant motifs. The proposed algorithm proved to be very fast and quite accurate, as shown by the results obtained when testing it on previously gathered sequences, with known motifs. The algorithm was able to accurately identify prominent motifs in sequences containing the NANOG, STAT1, RUNX3 and C-jun motifs. When analyzing a PITX1 sequence, the “E-box” motif was shown to be occurring very frequently. Upon further research, it was found that the PITX1 motif has protein-protein interactions with the E-box motif. Interestingly, they did not co-occur together in the same sequence, rather the DNA strand itself was bent to allow these interactions. In conclusion, a novel algorithm for accurately and quickly finding motifs in DNA sequences was successfully created. This will have several applications, such as identifying mutations in the epsilon4 allele of the Apolipoprotein E gene (APOE) that can cause Alzheimer’s disease. Identifying which parts of the DNA or RNA sequence are causing such diseases can lead to the better diagnosis and treatment, potentially a cure.