Implement strStr()
Implement strStr().Return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack
Input: text = "aaaaa", pattern = "bba"
Output: -1
Approach1: Brute Force
When tackling problems of this sort, an obvious method for doing a substring search would be to check, for each possible position in the text at which the pattern could match, whether it does indeed match.
We will keep one
pointer (i)
on the text and anotherpointer (j)
into the pattern.For each
i, j
resets to0
(starts fresh to look for pattern) and gets incremented until it finds a mismatch or utlimately finds a pattern(j == n)
if we reach the end of the text
(i == m - n + 1)
, we can return-1
thereby indicating we have not found the specified pattern within the text.The reason the loop terminates at
i == m - n + 1
opposed toi == m
is due to the fact that ifi
hasn - k
elements left in the loop wherek > 0
therefore it is mathematically and practically imopossible to identify a pattern at that point that would be of size n or greater.
**
class Solution(object):
def strStr(self, haystack: str, needle: str) -> int:
if needle == "":
return 0
m,n = len(haystack), len(needle)
for i in range(m - n + 1):
for j in range(n):
if haystack[i + j] != needle[j]:
break
if j == n - 1:
return i
return -1
Time: O(m * n) Space: O(1)
Real Life applications:
Text Mining: Tasks designed to extract previously unknown information by analyzing large quantities of text, as well as tasks geared toward the retrieval of textual data from a large array of documents. These mining techniques (Text categorization/ classification, and text clustering) in turn use some form of pattern matching (Sebastian, 2002; Fan et al., 2006; Sullivan, 2001).
Plagiarism Detection: I'm sure that we all remember that back in college, some professors wanted our essays to be emailed as well as printed out. It so happens that our essays were being compared and decomposed into string tokens and compared using string matching algorithms.(They probably implement some of the most optimized versions as mentionned below)
Honorable mentions: Bioinformatics and DNA Sequencing, Spam Filters, Spelling Checker
Why the Improvements ?
Let's say we're hypothetically dealing with an immense array of texts that contains long string of As. If the pattern that we want to identify also begin with long strings of A's, then finding the pattern will be slow. In that scenario, we probably would want to implement a faster algorithm. (Thankfully) We do not have to start the research ourselves as over the years, brilliant individuals have come up with different realizations that we can tap into in order to optimize our implementation. The different variations would be KMP, Boyer-Moore, Rabin-Karp and lastly the Z-algorithm.
Each of these algorithms presents their own strength and weaknesses based of the nature of the data they are dealing with but they will be nonetheless faster than the Brute-force.
Frankly, I find KMP the one I most often remember however I'd consider the Rabin-Karp, most pragmatic as you can tune it to your needs or extract it's underlying technique and use as an arsenal for various problems.
Approach2: Rabin-Karp (Monte Carlo) Java Version
I visualize the implementation of this algorithm as a ball that we are rolling over the characters of the texts. If the pattern is of size n then the ball can only covers n
size of texts as well as it rolls over(similar concept to the sliding-window technique). We create a hash function in order to get a relative values of the Characters covered by the ball. If hash(text)
at that point equals hash(pattern)
, we verify that each characters are in fact equal to each other. I've outlined the steps below.
Calculate Hash for pattern
Calculate Hash for 1st Windown in tent
Repeat the below until loops ends:
if
hash(pattern) == hash(text)
then match characters one by one because this algorithm can output a correct answer with a small probability(Monte Carlo). To further prevent the probability of collisions(or if your belief in probability theory is faint), we take the remainder of the output after being divided by a 31 bit prime number. Prime number is used to spread the distribution of hash output consequently reducing rate of collisions.Substract leftmost from
hash(text)
Shift entire
hash(text)
by unit to the rightAdd new Character to the window
Ultimately, we should retain that the competitive advantage of Rabin Karp is it's ability to reuse already calculated hash value in constant time (from which it derives it's name). A more straightforward but less efficent alternative is to recalculate the hash function of the characters that we're pointing to at each stop thereby rendering the algorithm more costly than the Brute-force.
import java.util.*;
import java.math.*;
class Solution{
private final static long Q = primeNumber();
private final static int R = 256;
long RM = 1;
//Rabin-Karp algorithm
public int strStr(String haystack, String needle){
int n = haystack.length();
int m = needle.length();
for(int i = 1; i <= m - 1; i++){
RM = (RM * R) % Q;
}
if(m > n ) return - 1;
long textHash = hash(haystack, n);
long patternHash = hash(needle, m);
if((textHash == patternHash) && check(haystack, needle, 0)){
return 0;
}
for(int i = m; i < n; i++){
textHash = (textHash + Q - RM * haystack.charAt(i - m) % Q) % Q;
textHash = (textHash * R + haystack.charAt(i) % Q);
int offset = i - m + 1;
if((textHash == patternHash) && check(haystack, needle, offset)){
return offset;
}
}
return n;
}
private boolean check(String hayStack, String needle, int i){
for(int j = 0; j < needle.length(); j++)
if(needle.charAt(j) != hayStack.charAt(i + j))return false;
return true;
}
private long hash(String t, int m){
long h = 0;
for(int i = 0; i < m; i++){
h = (h * R + t.charAt(i)) % Q;
}
return h;
}
private static long primeNumber(){
return BigInteger.probablePrime(31, new Random()).longValue();
}
Time: O(m + n) Space: O(1)
Last updated
Was this helpful?