Algorithm Write Up: Longest Unique Substring

So, you’re thinking of brushing up on your data structures and algorithms skills. Maybe you’re looking for a new job. Maybe you’ve been learning a new language and want to challenge yourself to apply it. Maybe you just like the challenge. No matter the reason, you sit down at your computer, go over to Leetcode, choose a fun-looking problem and get to work.

Practice makes perfect — Photo by Émile Perron on Unsplash

Unfortunately, you chose one that’s a little over your head, and your brute-force O(n!) solution isn’t accepted. That’s fine, let’s take a minute to think about this. A minute turns into two, turns into ten, and eventually you decide to look at the answer. Oh, you use a hash map to solve it, that makes sense. And you file that away for next week when you want to come back to the problem and solve it the right way. Then, next week comes and you don’t quite remember the trick you learned. So after a while trying to remember the key to that nice, optimized answer, you go back to the solution page to remind yourself, and the cycle repeats.

That’s a position I’ve found myself in more than once. So, I’ve decided to do something about it. Instead of struggling through them again and again, I’m going to write about the algorithms that give me trouble. The idea is to cement the solutions in my mind by writing them, and maybe even help some other poor sap in my shoes to wrap their head around something that’s giving them trouble. I’m using JavaScript because that’s the language I’m the most comfortable with, but there’s not much here that wouldn’t be applicable to another language with some minor translation. So, without further ado, here’s my write up of the Leetcode problem Longest Substring without Repeating Characters.

First of all, what is Longest Substring without Repeating Characters asking for? Well, to quote the problem description:

In other words, we’re looking for the longest portion of a string in which every character is unique. e.g. if we input “abcabcbb”, we should get back 3, which we get from “abc” (or “bca” or “cab”).

Now, we could always do this the brute force way: loop through each letter, and for each one of those letters, loop through the rest of the subsequent letters, checking if each new letter in the substring is unique. That‘s exponential time, so not great.

Here’s the code for that solution. Let’s not spend too much time on it

Now, for the optimal strategy: We’re going to get that exponential time down to linear time using a sliding window. To make that possible, we need to set up three things. First, a pointer — here meaning a variable storing an index in the string — which will keep track of the start of the current substring. Second a set of key-value pairs to keep track of the last index at which we saw any given letter. This article will be referring this this as the “seen object” because we’ll be coding out the problem in JavaScript, where that type of data structure is called an object. If you’re more familiar with Python, think of it as a dictionary; if Ruby is your thing, it’s a hash. Finally, we’ll keep track of the maximum length, which is what we want to return.

With that setup done, we can get to the meat of the algorithm. We’re going to use a for loop to go through each index of the string. If we’ve seen the letter at that index before (i.e. if it’s in our seen object) we’ll update the starting pointer to be the index after the last occurrence of that letter (more on this later). If we see an “a” at index 0 we’ll store { a: 1 }. If we see another “a” at index 1, we’ll update that to { a: 2 }.

Let’s take a look at how that seen object keeps our string unique. If we look at the string “abcabcbb” again, we’ll go through that first substring “abc” before encountering a duplicate in “abca.” How do we make “abca” unique again? There are two a’s we could remove. Removing the last one doesn’t help us much. In fact, that quite literally moves us backwards. Instead we should remove the first “a”, and we do that by changing the start pointer to the index after that first “a”, AKA the value stored in the seen object. That gives us the string “bca” which is unique. Let’s visualize that. I’m going to use carets to represent the starting (S) and ending (E) pointers:

   abcabcbb
S: ^
E: ^
Substring: a
abcabcbb
S: ^
E: ^
Substring: ab
abcabcbb
S: ^
E: ^
Substring: abc
abcabcbb
S: ^
E: ^
Substring: abca

That’s a repeated “a”. The last time we saw an “a” was at index 0, so let’s move the start to index 1:

   abcabcbb
S: ^
E: ^
Substring: bca

Now, in that case we only needed to move the start pointer up one. If we had a string like “abcdcba” we would get all the way to “abcdc” before hitting a duplicated “c”, and this strategy would have us move the start pointer up to the character after the first “c” at index 2: the “d” at index 3, giving us the unique substring “dc”. Let’s visualize that too:

   abcdcba
S: ^
E: ^
Substring: a
abcdcba
S: ^
E: ^
Substring: ab
abcdcba
S: ^
E: ^
Substring: abc
abcdcba
S: ^
E: ^
Substring: abcd
abcdcba
S: ^
E: ^
Substring: abcdc

Ah, my old friend “c”, we meet again. Last time we saw you was at index 2. That means we need to move the starting pointer to index 3 and get the substring “dc”.

   abcdcba
S: ^
E: ^
Substring: dc

That explanation under our belt, let’s check out the next step. As you iterate through the string and encounter duplicates, you don’t want to keep the original occurrence index stored. In the last example, after we move the starting pointer to “d” at index 3, index 2 is no longer the last time we saw “c”. Now, the last time we saw “c” is index 4. To keep track of that, we reassign the value of “c” in the seen object to 5 (index 4 plus 1).

We also want to be updating the max length of a unique substring as we go. However, we should be careful to only do that after we check its uniqueness and move the start pointer. If our longest substring was “abc”, we don’t want to say we saw a unique substring that was 4 characters long because we hit “abca”. We want to move the start pointer and then compare the length of “abc” to that of “bca”. That way, when we finish the for loop, we’ll have the maximum unique substring’s length in that max variable and we can return it.

Hopefully, that made some sense. If it didn’t, we’re about to code it out and go through our solution step by step, so if you’re they type that needs to see the code to understand the algorithm, just sit tight. That said, if you want to try your hand at coding it out yourself from the pseudocode below, you may want to do that now.

Let’s write that out with some comments

First step is setting up the start pointer, object, and max. There’s not too much challenge there, so let’s just get that written out. start is zero because that’s the first index, seen is an empty array because we haven’t seen any letters yet, and max is zero because that’s the shortest length there is.

Easy Peasy

Now things are going to start getting interesting. We’re going to set up a for loop which will take us through each number from zero to string.length — 1. In other words: each index in string. To make things easy on ourselves, let’s also store the current character in a variable: we’ll be needing it.

Here’s where things start to get a bit loopy

And finally we do the fun part. We’ll write an if statement to check if we’ve seen the letter before (i.e. if that letter is a key in seen). If we have seen the letter before, we need to move the starting pointer to the last time we saw that letter, unless that is less than the current starting pointer, in which case we won’t move the starting pointer. We’ll explore this more later.

Then, no matter if we’ve seen the character or not, we want to recalculate max. We’ll assign max to the larger of the length of the current (unique) substring or the pre-existing value of max. We also want to assign the current letter’s value in the seen object.

From the start, we’ve been talking about setting the value of the seen object to the current index plus one, and here’s where we learn why. First of all, when we move start, we want to move it to the index after the last occurrence of a character (i.e. that character’s last index plus one) and this is as good a place to do that as any. Second, we want to be able to detect an occurrence at index 0. If we have seen = { a: 0 } and try to base an if statement on seen['a'], it will be skipped over because 0 is falsey.

Anyway, let’s get that written out:

Now, write the rest of the algorithm

That is the main substance of this algorithm, but before we run through it with an example, I want to take some more time to dig into line 13: start = Math.max(seen[end], start); because that line took me a while to understand. Why do we do that? Why not just set start equal to seen[end] and call it a day? In short, it’s to prevent us from moving the start pointer backwards. For a more depth explanation, let’s take a look at the string “abba”.

We see our first duplicate at “abb”, so we move start to the index after the last time we saw “b”, which gives us the substring “b”. We then iterate once more and get “ba”, which is unique. However, we still have that first occurrence of “a” at index 1 stored in seen, so if we automatically send start to the last occurrence of “a”, we send the starting pointer backwards, and end up returning 4 because sending the pointer backwards gives us the string “abba”. Not what we want. However, if we check that the last occurrence is in our current substring (i.e. is greater than start) before moving the starting pointer, then we avoid that whole issue. Let’s visualize that:

   abba
S: ^
E: ^
Seen: {}
abba
S: ^
E: ^
Seen: {a: 1}
abba
S: ^
E: ^
Seen: {a: 1, b:2}
This isn't a full 'step' of the loop, but it's helpful to show the start moving: abba
S: ^
E: ^
Seen: {a: 1, b:3}
abba
S: ^
E: ^
Seen: {a: 1, b:3}

Without the check described above, we end up here:

   abba
S: ^
E: ^
Seen: {a: 4, b:3}

But with it, we end up with this:

   abba
S: ^
E: ^
Seen: {a: 4, b:3}

That’s it! We have our code. Time to run through it. Let’s start with Leetcode’s very own test “abcabcbb”. We’ll start with both the starting pointer and ending pointer at 0 and an empty seen object, and build them up as we step through the function. The first segments will look familiar from when we stepped through the algorithm state earlier, although these are strictly what the state looks like at the end of each step of the for loop:

Str:     abcabcbb
Start: 0 ^
End: 0 ^
Seen: { a: 1 }
Max: 1
Str: abcabcbb
Start: 0 ^
End: 1 ^
Seen: { a: 1, b: 2 }
Max: 2
Str: abcabcbb
Start: 0 ^
End: 2 ^
Seen: { a: 1, b: 2, c: 3 }
Max: 3

At this point, we hit our first duplicate, so we move up the start pointer and reassign the value for “a”.

Str:     abcabcbb
Start: 1 ^
End: 3 ^
Seen: { a: 4, b: 2, c: 3 }
Max: 3

Once again, we hit a duplicate, so we move start and reassign “b”, and so on and so forth:

Str:     abcabcbb
Start: 2 ^
End: 4 ^
Seen: { a: 4, b: 5, c: 3 }
Max: 3
Str: abcabcbb
Start: 3 ^
End: 5 ^
Seen: { a: 4, b: 5, c: 6 }
Max: 3
Str: abcabcbb
Start: 5 ^
End: 6 ^
Seen: { a: 4, b: 7, c: 6 }
Max: 3
Str: abcabcbb
Start: 7 ^
End: 7 ^
Seen: { a: 4, b: 8, c: 6 }
Max: 3

As soon as we hit end === 7, we’ve hit the end of the loop and we can return 3, and that’s how we find the longest substring without repeating characters.

Interested in civic tech, data visualization, and data science. Published in The Startup. Sometimes I have thoughts.