Question

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

  • For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Ideas:

  • substrings - > Sliding Window O(n) solution
  • start at beginning of array [0:11] to get first 10 letters, move towards end
  • DS:
    • Hashmap: sorted string → list of variations
  • On update:
  • while r < len(s)
    • Check if window is in hashmap and that its value is the empty list:
    • if true add to res
    • if not add key to map and initialize to []
    • Probs some idomatic python way to do both operations at once
    • increment l and r pointers
  • Hashmap will now only contain a single occurance of subsequences have have appeared at least twice
  • We could probs just use a hashset here but when I read the question I didn’t fully understand what they were asking
    • We could use 2 hash sets, one for all subsequences and one for our result and basically just append to the result if we are in all patterns already, otherwise add to all patterns
  • Edge cases:
    • Not 10 chars long
    • Char is not in sequence (we would actually solve this as well)

Solution 1 Solution 2