Repeated DNA Sequences
Master this topic with zero to advance depth.
Repeated DNA Sequences
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Examples
Level I: Brute Force (Hash Set of Strings)
Intuition
Iterate through every possible 10-letter substring and store it in a hash set. If we encounter a substring that is already in the set, we add it to our result list.
Thought Process
- Use a
seenhash set to track substrings. - Use a
repeatedhash set to collect result strings. - Loop
ifrom 0 tos.length() - 10:- Extract substring
s.substring(i, i + 10). - If in
seen, add torepeated. - Else, add to
seen.
- Extract substring
Pattern: String Rolling Window
Level II: Rolling Hash (Rabin-Karp)
Intuition
Instead of re-extracting substrings, we treat each 10-letter window as a number in base 4 (since there are 4 DNA bases). As we slide the window, we update the hash in time.
Level III: Optimal (Bitmasking)
Intuition
Since there are only 4 characters, map them to 2 bits. A 10-char sequence is a 20-bit integer. Use a sliding window to update the bitmask in per character.
Thought Process
- Map A=0, C=1, G=2, T=3.
- Maintain a 20-bit sliding window.
- Use a hash map/set to track encountered bitmasks.
Pattern: 2-bit Character Encoding
Found an issue or have a suggestion?
Help us improve! Report bugs or suggest new features on our Telegram group.