Algorithm for multiple word matching in text

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.

Is there a faster algorithm than doing a simple text search for each of the words in the block of text?

Multiple events matching algorithm

I have a task to match multiple events(facts) with each other by some their properties. As a result of events matching some action should be generated. Action can be generated when events of all exist

Algorithm for word matching [closed]

My main idea is to find an algorithm ( Java ) that takes the random letters which someone has typed in a JoptionPane for instance and then instantly by pressing Find words i would like the program t

Find First Word matching from Given Text – Regex

I want to find First Word matching from Given Text and replace with another word, using Regex. Consider following string as an Example Text Which type is your item? i suppose that the item isn’t a st

Matching multiple keywords in a text (Javascript)

Here is my code: // Matching multiple keywords in a text // Case 1 var text = Hello, My name is @Steve, I love @Bill, happy new year!; var keywords = [steve]; var matching = text.toLowerCase().sea

Multiple short rules pattern matching algorithm

As the title advances, we would like to get some advice on the fastest algorithm available for pattern matching with the following constrains: Long dictionary: 256 Short but not fixed length rules (fr

Multiple text comparison algorithm

Given multiple texts that are slightly different from one another (some words missing / replaced by other words), is there a good algorithm to create some sort of template out of them? For example:

Best word wrap algorithm?

Word wrap is one of must-have features in modern text editor. Do you know how to handle word wrap? What is the best algorithm for word-wrap? updated: If text is several million lines, how can I make w

Regexp, matching a word multiple times

This baffles me, though i suppose the problem is quite simple. Given this piece of code: var text = ‘ url(https://) repeat scroll 0% 0%, url(https://) repeat scroll 0% 0%, url(https://) repeat s

Algorithm for matching partially filled words

I am writing a game which when given a partially filled word, searches a dictionary and returns all the matching words. To that effect, I am trying to find an algorithm that can be used for the said p

Need string matching algorithm

Data: x.txt,simple text file(around 1 MB) y.txt dictionary file(around 1Lakh words). Need to find whether any of the word/s in y.txt is present in x.txt. Need an algorithm which consumes less time fo

Answers

input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.

Faster though I don’t know, just another method (would depend on how many words you are searching for).

simple perl examp:

my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my @words = split //s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (@words)
{
    print "found word: $word/n" if exists $hash{$word};
}

__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown

The answer heavily depends on the actual requirements.

  1. How large is the word list?
  2. How large is the text block?
  3. How many text blocks must be processed?
  4. How often must each text block be processed?
  5. Do the text blocks or the word list change? If, how frequent?

Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.

If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.

In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.

Try out the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

Build up a trie of your words, and then use that to find which words are in the text.

you can build a graph used as a state machine and when you process the ith character of your input word – Ci – you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci

ex: if you have the following words in your corpus
(“art”, “are”, “be”, “bee”)
you will have the following nodes in your graph
n11 = ‘a’
n21 = ‘r’
n11.sons = (n21)
n31 = ‘e’
n32= ‘t’
n21.sons = (n31, n32)
n41=’art’ (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = ‘are’ (here again we have a word)
n32.sons = (n42)
n12 = ‘b’
n22 = ‘e’
n12.sons = (n22)
n33 = ‘e’
n34 = ‘be’ (word)
n22.sons = (n33,n34)
n43 = ‘bee’ (word)
n33.sons = (n43)

during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.

This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use

The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also – you probably want to remove any dups from both lists.