Implement strStr()
Implement strStr().Return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack
Approach1: Brute Force
When tackling problems of this sort, an obvious method for doing a substring search would be to check, for each possible position in the text at which the pattern could match, whether it does indeed match.
We will keep one
pointer (i)
on the text and anotherpointer (j)
into the pattern.For each
i, j
resets to0
(starts fresh to look for pattern) and gets incremented until it finds a mismatch or utlimately finds a pattern(j == n)
if we reach the end of the text
(i == m - n + 1)
, we can return-1
thereby indicating we have not found the specified pattern within the text.The reason the loop terminates at
i == m - n + 1
opposed toi == m
is due to the fact that ifi
hasn - k
elements left in the loop wherek > 0
therefore it is mathematically and practically imopossible to identify a pattern at that point that would be of size n or greater.
**
Time: O(m * n) Space: O(1)
Real Life applications:
Plagiarism Detection: I'm sure that we all remember that back in college, some professors wanted our essays to be emailed as well as printed out. It so happens that our essays were being compared and decomposed into string tokens and compared using string matching algorithms.(They probably implement some of the most optimized versions as mentionned below)
Honorable mentions: Bioinformatics and DNA Sequencing, Spam Filters, Spelling Checker
Why the Improvements ?
Let's say we're hypothetically dealing with an immense array of texts that contains long string of As. If the pattern that we want to identify also begin with long strings of A's, then finding the pattern will be slow. In that scenario, we probably would want to implement a faster algorithm. (Thankfully) We do not have to start the research ourselves as over the years, brilliant individuals have come up with different realizations that we can tap into in order to optimize our implementation. The different variations would be KMP, Boyer-Moore, Rabin-Karp and lastly the Z-algorithm.
Each of these algorithms presents their own strength and weaknesses based of the nature of the data they are dealing with but they will be nonetheless faster than the Brute-force.
Frankly, I find KMP the one I most often remember however I'd consider the Rabin-Karp, most pragmatic as you can tune it to your needs or extract it's underlying technique and use as an arsenal for various problems.
Approach2: Rabin-Karp (Monte Carlo) Java Version
Calculate Hash for pattern
Calculate Hash for 1st Windown in tent
Repeat the below until loops ends:
if
hash(pattern) == hash(text)
then match characters one by one because this algorithm can output a correct answer with a small probability(Monte Carlo). To further prevent the probability of collisions(or if your belief in probability theory is faint), we take the remainder of the output after being divided by a 31 bit prime number. Prime number is used to spread the distribution of hash output consequently reducing rate of collisions.Substract leftmost from
hash(text)
Shift entire
hash(text)
by unit to the rightAdd new Character to the window
Ultimately, we should retain that the competitive advantage of Rabin Karp is it's ability to reuse already calculated hash value in constant time (from which it derives it's name). A more straightforward but less efficent alternative is to recalculate the hash function of the characters that we're pointing to at each stop thereby rendering the algorithm more costly than the Brute-force.
Time: O(m + n) Space: O(1)
Last updated
Was this helpful?