Repeated DNA Sequences

Master this topic with zero to advance depth.

Repeated DNA Sequences

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Medium

Examples

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Approach 1

Level I: Brute Force (Hash Set of Strings)

Intuition

Iterate through every possible 10-letter substring and store it in a hash set. If we encounter a substring that is already in the set, we add it to our result list.

Thought Process

Use a seen hash set to track substrings.
Use a repeated hash set to collect result strings.
Loop i from 0 to s.length() - 10:
- Extract substring s.substring(i, i + 10).
- If in seen, add to repeated.
- Else, add to seen.

Pattern: String Rolling Window

⏱ O(N \cdot L) - Where $N$ is string length and $L=10$ is the sequence length.💾 O(N \cdot L) - To store strings in sets.

java

import java.util.*;

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Set<String> seen = new HashSet<>(), repeated = new HashSet<>();
        for (int i = 0; i <= s.length() - 10; i++) {
            String sub = s.substring(i, i + 10);
            if (seen.contains(sub)) repeated.add(sub);
            else seen.add(sub);
        }
        return new ArrayList<>(repeated);
    }
}

Approach 2

Level II: Rolling Hash (Rabin-Karp)

Intuition

Instead of re-extracting substrings, we treat each 10-letter window as a number in base 4 (since there are 4 DNA bases). As we slide the window, we update the hash in $O(1)$ time.

⏱ $O(N)$💾 $O(N)$

java

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        if (s.length() < 10) return new ArrayList<>();
        int[] map = new int[256];
        map['A'] = 0; map['C'] = 1; map['G'] = 2; map['T'] = 3;
        Set<Integer> seen = new HashSet<>(), rep = new HashSet<>();
        int h = 0;
        for (int i = 0; i < 10; i++) h = (h << 2) | map[s.charAt(i)];
        seen.add(h);
        for (int i = 10; i < s.length(); i++) {
            h = ((h << 2) | map[s.charAt(i)]) & 0xFFFFF;
            if (!seen.add(h)) {
                // Logic to track only duplicate sequences efficiently
            }
        }
        return new ArrayList<>();
    }
}

Approach 3

Level III: Optimal (Bitmasking)

Intuition

Since there are only 4 characters, map them to 2 bits. A 10-char sequence is a 20-bit integer. Use a sliding window to update the bitmask in $O(1)$ per character.

Thought Process

Map A=0, C=1, G=2, T=3.
Maintain a 20-bit sliding window.
Use a hash map/set to track encountered bitmasks.

Pattern: 2-bit Character Encoding

⏱ O(N) - Linear scan without string creation.💾 O(min(N, 2^{20})) - Hash set storage for 20-bit integers.

java

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        if (s.length() < 10) return new ArrayList<>();
        Map<Character, Integer> toInt = Map.of('A', 0, 'C', 1, 'G', 2, 'T', 3);
        Set<Integer> seen = new HashSet<>();
        Set<String> res = new HashSet<>();
        int mask = (1 << 20) - 1, curr = 0;
        for (int i = 0; i < s.length(); i++) {
            curr = ((curr << 2) | toInt.get(s.charAt(i))) & mask;
            if (i >= 9) {
                if (seen.contains(curr)) res.add(s.substring(i - 9, i + 1));
                else seen.add(curr);
            }
        }
        return new ArrayList<>(res);
    }
}

← Previous QuestionGray Code

Next Question →Maximum Product of Word Lengths

Found an issue or have a suggestion?

Help us improve! Report bugs or suggest new features on our Telegram group.