A DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: 'ACGAATTCCG'. Write a function to find all the 10-letter-long sequences(sub-strings) that occur more than once in a DNA molecule. Example - Input: 'AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT', Expected Output: [AAAAACCCCC,CCCCCAAAAA]
Please try solving this problem before jumping on the solution
Click to learn
Subscribe for more updates
Preparing for interviews? IDeserve team is here to help.
Create your profile
Create your profile, and here is what you will get:
1: Interview practice platform.
2: Once you are ready to take the interview, IDeserve team will help you get connected to the best job opportunities.
3: Personalized mentorship from IDeserve team once your interview process has started.
Creation of profile shouldn't take more than 2 minutes.
A naive approach of sliding the 10-letter window across the given sequence combined with hashmap would take O(n^2) time. Using rolling hash method takes O(n) time. Here are the steps. 1. Compute hash value for the sequence in the first window. 2. Store the computed value in a set/hashmap. 3. Compute hash values for subsequent sequence(which would be sequence obtained by sliding 10-letter window to the right by one character) using rolling hash method. Rolling hash method makes use of hash value computed for previous sequence to compute hash value for current sequence. 4. If computed hash value is already present in the hashmap then add the current sequence to output set, else store the computed value in the hashmap. . 5. Repeat step #3, #4 until all the 10-letter sequences are completed.
Rolling Hash computation uses following method. For details and intuition, please check out the video. currHash = prevHash - val(skippedChar)*2^10 currHash = currHash * 2 currHash = currHash + 2*val(newChar)
Support us by whitelisting IDeserve in your ad-blocker.