Home/dsa/Bit Manipulation/Repeated DNA Sequences

Repeated DNA Sequences

Master this topic with zero to advance depth.

Repeated DNA Sequences

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Medium

Examples

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
Approach 1

Level I: Brute Force (Hash Set of Strings)

Intuition

Iterate through every possible 10-letter substring and store it in a hash set. If we encounter a substring that is already in the set, we add it to our result list.

Thought Process

  1. Use a seen hash set to track substrings.
  2. Use a repeated hash set to collect result strings.
  3. Loop i from 0 to s.length() - 10:
    • Extract substring s.substring(i, i + 10).
    • If in seen, add to repeated.
    • Else, add to seen.

Pattern: String Rolling Window

O(N \cdot L) - Where $N$ is string length and $L=10$ is the sequence length.💾 O(N \cdot L) - To store strings in sets.
java
import java.util.*;

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Set<String> seen = new HashSet<>(), repeated = new HashSet<>();
        for (int i = 0; i <= s.length() - 10; i++) {
            String sub = s.substring(i, i + 10);
            if (seen.contains(sub)) repeated.add(sub);
            else seen.add(sub);
        }
        return new ArrayList<>(repeated);
    }
}
Approach 2

Level II: Rolling Hash (Rabin-Karp)

Intuition

Instead of re-extracting substrings, we treat each 10-letter window as a number in base 4 (since there are 4 DNA bases). As we slide the window, we update the hash in O(1)O(1) time.

$O(N)$💾 $O(N)$
java
class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        if (s.length() < 10) return new ArrayList<>();
        int[] map = new int[256];
        map['A'] = 0; map['C'] = 1; map['G'] = 2; map['T'] = 3;
        Set<Integer> seen = new HashSet<>(), rep = new HashSet<>();
        int h = 0;
        for (int i = 0; i < 10; i++) h = (h << 2) | map[s.charAt(i)];
        seen.add(h);
        for (int i = 10; i < s.length(); i++) {
            h = ((h << 2) | map[s.charAt(i)]) & 0xFFFFF;
            if (!seen.add(h)) {
                // Logic to track only duplicate sequences efficiently
            }
        }
        return new ArrayList<>();
    }
}
Approach 3

Level III: Optimal (Bitmasking)

Intuition

Since there are only 4 characters, map them to 2 bits. A 10-char sequence is a 20-bit integer. Use a sliding window to update the bitmask in O(1)O(1) per character.

Thought Process

  1. Map A=0, C=1, G=2, T=3.
  2. Maintain a 20-bit sliding window.
  3. Use a hash map/set to track encountered bitmasks.

Pattern: 2-bit Character Encoding

O(N) - Linear scan without string creation.💾 O(min(N, 2^{20})) - Hash set storage for 20-bit integers.
java
public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        if (s.length() < 10) return new ArrayList<>();
        Map<Character, Integer> toInt = Map.of('A', 0, 'C', 1, 'G', 2, 'T', 3);
        Set<Integer> seen = new HashSet<>();
        Set<String> res = new HashSet<>();
        int mask = (1 << 20) - 1, curr = 0;
        for (int i = 0; i < s.length(); i++) {
            curr = ((curr << 2) | toInt.get(s.charAt(i))) & mask;
            if (i >= 9) {
                if (seen.contains(curr)) res.add(s.substring(i - 9, i + 1));
                else seen.add(curr);
            }
        }
        return new ArrayList<>(res);
    }
}