Home

Advertisement

Customize

finding algorithms

Dear lazyweb, I recently found myself having a set of Unicode strings (in main memory) and the desire to search all of them for a particular prefix (not substring). For bonus points, let's say I want to do it in parallel (seems like an easily parallelizable problem, although at worst you can divide-and-conquer for a given number N of processors). The constraints are that the set of strings can change, but not very often usually. I don't want too large of a hit to compute tables and the like though.


The real question is - how do you find papers (and implementations thereof) for computer science problems like this? My first instinct was to get an ACM subscription, which I did. Their search turns out to be mediocre, and then I discovered Wikipedia has useful pages on these kinds of things. Wikipedia also has the winning feature of linking to quality free software implementations thereof.


On this particular problem, a lot of the literature is around substring matching which isn't what I want, at least not directly. Still looking.

Comments

I generally run Google over CiteSeer (their own search engine is crap) and/or use the electronic database search at my uni.

If you don't have access to a university/college library database, then email your own/you local university/college's computer science department. They are generally happy to help.

/Mike

some sort of self balancing tree, like a splay tree

I think this (or at least a good starting point) should be in any textbook used in a graduate level algorithms class. I am reasonably sure ours had an algorithm for that and it had a funny name, but that's all I remember, of course.
E.g. MIT's class (http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-854JFall-2005/CourseHome/) references this book:
http://www.amazon.com/exec/obidos/ASIN/0521585198/ref=nosim/mitopencourse-20
(and yes, at this point anything that has to do with advanced string searching will end up with genetics-related stuff).

Re: some sort of self balancing tree, like a splay tree

That's a good suggestion - from the MIT class page I found Suffix trees which look useful. I should unpack my school textbooks too..

(Anonymous)

Re: some sort of self balancing tree, like a splay tree

There are 3 options:

1) An array. Binary search is cool. Depends on what it would take to sort your strings (how they change, where they change) but you could use gap allocation to save a lot of speed if its the right data set.

2) A radix tree. One of my favorite data structures. Could save you space as well as speed.

3) Deterministic Skip List. Baddest datastructure of all time. Logical equivalent to a 2-3-4 B-Tree, but 1/4 the implementation effort. If I don't assume anything else other than what you gave me about this problem, this one wins hands down.

Ping me on IRC about this if you want.

Casey Dahlin

(Anonymous)

For this particular problem it seems the fastest (in terms of asymptotic complexity) solution is to build a trie of all the strings (that's O(total length of strings). Then you can simply walk the trie using the prefix (that's O(length of prefix)), and collect all strings in the trie subtree (that's O(total length of resulting strings) at worst, and might be O(number of resulting strings) if the strings are sorted).

The only potentially parallelizable parts are setting up the trie and collecting strings in a trie subtree.

Mirek

Finding algorithms

(Anonymous)

Knuth and Cormen

Hi

Half of volume 3 in 'The art of computer programming' is about searching. There are interesting search and data-structure chapters (sequential, radix-tree, Knuth-Morris-Pratt etc) in 'Introduction to algorithms', too.

Re: Knuth and Cormen

I only have Vol 1, I should get the others. But the problem with Knuth is that it predates Unicode - for string algorithms in particular Unicode does matter.

(Anonymous)

Re: Knuth and Cormen

Hm, yes, you're right...

How about this article - collators for language-sensitive sorting and searching

http://www.icu-project.org/docs/papers/efficient_text_searching_in_java.html

Re: Knuth and Cormen

Use UTF-8, it's designed in a way that dumb octet-based substring searching is safe assuming valid input (and it can be validated first if that's your concern, that should be done anyway if it's security-sensitive in any way). If you use algorithms which don't make any assumptions about there being a low number of possible characters, UTF-16 or even UTF-32 should also be safe. (UTF-16 uses separate first and second surrogates exactly so that dumb bi-octet substring searching is safe.)

(Anonymous)

Citeseer

As others have pointed out, the best thing for this kind of problem is Knuth, but for general computer science paper research I heavily use CiteSeer (http://citeseer.ist.psu.edu/)

trie common prefix

trie was the first thing that popped into my head. A quick search looks promising:
http://www.google.ie/search?q=trie+common+prefix

Advertisement

Customize