Keyword Matching

Specifications and Algorithms

 

Searching the list of files for matches to a query is one of the expensive parts of LimeWire.  As of this writing (12/12), it consumes about 87% of the CPU time.  A number of improvements are possible, some resulting in weaker policies.  These are described below in terms of the difficulty of implementation.   In all of the following algorithms, case is ignored.  I’ll also ignore wildcards; because wildcard searches are so rare, they can be special cased.

 

Let WORDS(s) be all the words of a string, tokenized by spaces, +’s, etc.  Example: WORDS(“beatles yellow+submarine”) is the set {“beatles”, “yellow”, “submarine”}.  I will make much use of this helper function.

 

1.    Keyword Substring Matching

 

This is the strongest policy.  A query Q matches a file name F if all strings in WORDS(Q) are a substring of F.  For example, “ello eatle” will match “beatles yellow+submarine” since both “ello” and “eatle” are substrings of “beatles yellow+submarine”. 

 

This is the current policy.  It is implemented in a straightforward manner, by searching each file name in a list.  The problem is that the time complexity is proportional to the number of files and the length of those file names.  If you have 20K files, that’s a problem.  At the end of this document, I give a faster algorithm.

2.    Keyword Subset Matching

 

This is the weakest policy.  A query Q matches a file name F if WORDS(Q) is a subset of WORDS(F).  For example “submarine beatles” will match the above file name, but “sub beatles” will not.

 

This can be implemented efficiently by indexing each file name and storing the keywords in a hash table.  Then matching can be done in constant time.  The space required is only proportional to the sum of the lengths of all filenames.  Space can be compressed further by storing the keywords in a Trie, a tree-like structure discussed below.

3.    Keyword Prefix Matching

 

A query Q matches a file name F if all strings in WORDS(Q) are a prefix of some element of WORDS(F).  For example, “sub beatle” will match the above file name, but “sub eatles” will not.

 

This is best implemented by indexing each file name and storing the keywords in a Trie.  This is a tree-like data structure that takes advantage of words with common prefixes.  For example, the files “funny mp3 and funny fund” have the keywords “funny”, “mp3”, and “fund”.  These would be stored in the Trie as

 

fun

   ny

   d

mp3

 

Searching then proceeds in linear time with respect to the length of the query.

4.    Faster Keyword Substring Matching

 

The same policy described in (1) can be implemented in the same time as (3) by using a Suffix Tree, but at the cost of more space.  Assume we have the files “cat-mp3” and “the-cat”.  To naively compute the suffix tree, we first compute all the suffixes of each file name:

 

3

p3

mp3

-mp3

t-mp3

at-mp3

cat-mp3

t

at

cat

-cat

e-cat

he-cat

the-cat

 

Then we sort them:

 

-cat

-mp3

3

at

at-mp3

cat

cat-mp3

e-cat

he-cat

mp3

p3

t

t-mp3

the-cat

 

Finally we observe that many of these suffixes have common prefixes.  Taking advantage of this, we compress them into a Trie:

 

-

  cat

  mp3

3

at

  -mp3

cat

  -mp3

e-cat

he-cat

mp3

p3

t

 -mp3

the-ca

 

Now let W be an element of WORDS(Q).  Note that W is a substring of F iff W is a prefix of some substring of W.  Using this idea, we simply match W against the Trie above.  This takes time proportional to the length of W.

 

The problem is that the suffix tree can be quite large, since representing each file in the tree may take space quadratic with respect to the length of each file name.  If many files have overlapping prefixes, e.g. “Funny video” and “Funny movie”, this will be ameliorated somewhat.  Furthermore, if we weaken the specification slightly, we can reduce the space requirements by only computing the suffix tree on the keywords in each filename, not the full filename.

 

 

Christopher Rohrs

12/12/2000