💡
Seb's Whiteboard
  • 👨‍💻Welcome, I'm Sebastien St Vil
  • Extras
    • Gradient Descent
    • How I learned Java
    • Machine Learning by Andrew Ng
  • Projects
    • 📉Backtest Equity Trading with SMA Strategy
    • Wellington GA Lab
    • Stock analysis
    • 📈Time Series Regression-based Trading Strategy
  • Arrays & Strings
    • Best Time to Buy and Sell Stock II
    • Online Stock Span
    • Implement strStr()
    • 2Sum
    • 3Sum
    • 3Sum Closest
    • 4Sum II
    • Set Matrix Zeroes
    • Group Anagrams
    • Longest Substring Without Repeating Characters
    • Remove Duplicates from Sorted Array
    • Move Zeroes
    • Valid Sudoku
    • Rotate Image
    • First Unique Character in a String
    • Design a Circular Queue
    • Longest Common Prefix
  • Binary Tree
  • Second Minimum Node In a Binary Tree (671)
  • Design
  • LRU Cache
  • Min Stack (155)
  • Sorting & Searching
    • Merge Sorted Array (88)
    • First Bad Version
  • Math
    • Power of Three (326)
    • Power of Two (231)
    • Count Prime (204)
    • Roman to Integer (13)
    • Fizz Buzz (412)
    • Count-and-Say
  • Dynamic Programming
    • Pascal's Triangle (118)
  • Linked List
    • Copy List with Random Pointer
    • Remove Nth Node From End of List
    • Remove Duplicated from Sorted List II
  • Tips
    • Finding dups
  • Sliding Window
    • Subarray Product Less Than K
Powered by GitBook
On this page
  • Approach1: Brute Force
  • Real Life applications:
  • Why the Improvements ?

Was this helpful?

  1. Arrays & Strings

Implement strStr()

Implement strStr().Return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack

Input: text = "aaaaa", pattern = "bba"
Output: -1

Approach1: Brute Force

When tackling problems of this sort, an obvious method for doing a substring search would be to check, for each possible position in the text at which the pattern could match, whether it does indeed match.

  • We will keep one pointer (i) on the text and another pointer (j) into the pattern.

  • For each i, j resets to 0 (starts fresh to look for pattern) and gets incremented until it finds a mismatch or utlimately finds a pattern (j == n)

  • if we reach the end of the text (i == m - n + 1), we can return -1 thereby indicating we have not found the specified pattern within the text.

  • The reason the loop terminates at i == m - n + 1 opposed to i == m is due to the fact that if i has n - k elements left in the loop where k > 0 therefore it is mathematically and practically imopossible to identify a pattern at that point that would be of size n or greater.

**

class Solution(object):
    def strStr(self, haystack: str, needle: str) -> int:
        if needle == "":
            return 0
        m,n = len(haystack), len(needle)
        
        for i in range(m - n + 1):
            for j in range(n):
                if haystack[i + j] != needle[j]:
                    break
                if j == n - 1:
                    return i
        return -1

Time: O(m * n) Space: O(1)

Real Life applications:

  • Plagiarism Detection: I'm sure that we all remember that back in college, some professors wanted our essays to be emailed as well as printed out. It so happens that our essays were being compared and decomposed into string tokens and compared using string matching algorithms.(They probably implement some of the most optimized versions as mentionned below)

  • Honorable mentions: Bioinformatics and DNA Sequencing, Spam Filters, Spelling Checker

Why the Improvements ?

Let's say we're hypothetically dealing with an immense array of texts that contains long string of As. If the pattern that we want to identify also begin with long strings of A's, then finding the pattern will be slow. In that scenario, we probably would want to implement a faster algorithm. (Thankfully) We do not have to start the research ourselves as over the years, brilliant individuals have come up with different realizations that we can tap into in order to optimize our implementation. The different variations would be KMP, Boyer-Moore, Rabin-Karp and lastly the Z-algorithm.

Each of these algorithms presents their own strength and weaknesses based of the nature of the data they are dealing with but they will be nonetheless faster than the Brute-force.

Frankly, I find KMP the one I most often remember however I'd consider the Rabin-Karp, most pragmatic as you can tune it to your needs or extract it's underlying technique and use as an arsenal for various problems.

Approach2: Rabin-Karp (Monte Carlo) Java Version

  • Calculate Hash for pattern

  • Calculate Hash for 1st Windown in tent

  • Repeat the below until loops ends:

    • if hash(pattern) == hash(text) then match characters one by one because this algorithm can output a correct answer with a small probability(Monte Carlo). To further prevent the probability of collisions(or if your belief in probability theory is faint), we take the remainder of the output after being divided by a 31 bit prime number. Prime number is used to spread the distribution of hash output consequently reducing rate of collisions.

    • Substract leftmost from hash(text)

    • Shift entire hash(text) by unit to the right

    • Add new Character to the window

    Ultimately, we should retain that the competitive advantage of Rabin Karp is it's ability to reuse already calculated hash value in constant time (from which it derives it's name). A more straightforward but less efficent alternative is to recalculate the hash function of the characters that we're pointing to at each stop thereby rendering the algorithm more costly than the Brute-force.

import java.util.*;
import java.math.*;

class Solution{
 private final static long Q = primeNumber();
 private final static int R = 256;
 long RM = 1;
 
 //Rabin-Karp algorithm
 public int strStr(String haystack, String needle){
  int n = haystack.length();
  int m = needle.length();
  
  for(int i = 1; i <= m - 1; i++){
   RM = (RM * R) % Q;
   }
   
 if(m > n ) return - 1;
 
    long textHash = hash(haystack, n);
    long patternHash = hash(needle, m);
    
    if((textHash == patternHash) && check(haystack, needle, 0)){
       return 0;
       }
       
   for(int i = m; i < n; i++){
     textHash = (textHash + Q  - RM * haystack.charAt(i - m) % Q) % Q;
     textHash = (textHash * R + haystack.charAt(i) % Q);
     int offset = i - m + 1;
     if((textHash == patternHash) && check(haystack, needle, offset)){
       return offset;
         }
        }
        return n; 
      }
      
   private boolean check(String hayStack, String needle, int i){
     for(int j = 0; j < needle.length(); j++)
       if(needle.charAt(j) != hayStack.charAt(i + j))return false;
     return true;
     }
     
   private long hash(String t, int m){
       long h = 0;
       for(int i = 0; i < m; i++){
         h = (h * R + t.charAt(i)) % Q;
         }
       return h;
       }
       
   private static long primeNumber(){
       return BigInteger.probablePrime(31, new Random()).longValue();
   }

Time: O(m + n) Space: O(1)

PreviousOnline Stock SpanNext2Sum

Last updated 3 years ago

Was this helpful?

Text Mining: Tasks designed to extract previously unknown information by analyzing large quantities of text, as well as tasks geared toward the retrieval of textual data from a large array of documents. These mining techniques (Text categorization/ classification, and text clustering) in turn use some form of pattern matching

I visualize the implementation of this algorithm as a ball that we are rolling over the characters of the texts. If the pattern is of size n then the ball can only covers n size of texts as well as it rolls over(similar concept to the ). We create a hash function in order to get a relative values of the Characters covered by the ball. If hash(text) at that point equals hash(pattern), we verify that each characters are in fact equal to each other. I've outlined the steps below.

(Sebastian, 2002; Fan et al., 2006; Sullivan, 2001).
sliding-window technique