You are given a data set consisting of DNA sequences (the file is available here) of the same length. Each DNA sequence is a string of characters from the alphabet ‘A’,’C’,’T’,’G’, and it represents a particular viral strain sampled from an infected individual. Your goal is to write a code that helps to identify transmission clusters corresponding to outbreaks.
The sequences should be considered as feature vectors and characters - as features. The data set is stored as a fasta file, which is essentially a text file that has the following form:
>Name of Sequence1
AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG
>Name of Sequence2
AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG
>Name of Sequence3
AAGCACAGGATGTAATGGTGGGGCCGACCGCCTATTATTCTGATGATTACTTGAGGCCCTCGGAGAGGAAGGGG
…..
Here each line starting with ‘>’ symbol contains the name of a sequence followed by the sequence itself in the next line.
You may proceed as follows:
1) Read sequences from the file.
2) Calculate pairwise distances between sequences. Use Hamming distance: it is the number of positions at which the sequences are different (see https://en.wikipedia.org/wiki/Hamming_distance)
3) Project the sequences in 2-D space using Multidimensional Scaling (MDS) based on Hamming distance matrix.
4) Plot the obtained 2-D data points. Estimate the number of clusters K by visual inspection.
5) Use k-means algorithm to cluster the 2-D data points.
You may use library functions to read data from the file and perform MDS. For multidimensional scaling in python, see e.g. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
K-means clustering should be implemented from scratch. Your submission should contain:
The code of your script
Visualization plots for MDS with different clusters highlighted in different colors.
Please do not hesitate to ask questions.
Contact us to get instant help:
contact@codersarts.com
Comments