Creating a custom search engine an be very difficult. The concept of writing the next Google or Bing can be attractive, but it is probably more difficult that you think. If you are thinking about writing your own search engine, you need to read this article by Anna Patterson.
There must be 4,000 programmers typing away in their basements trying to build the next “world’s most scalable” search engine. It has been done only a few times. It has never been done by a big group; always one to four people did the core work, and the big team came on to build the elaborations and the production infrastructure. Why is it so hard? We are going to delve a bit into the various issues to consider when writing a search engine. This article is aimed at those individuals or small groups that are considering this endeavor for their Web site or intranet. It is fun, but a word of caution: not only is it difficult, but you need two commodities in short supply—time and patience.
SUPER-SHORT SEARCH ENGINE OVERVIEW
OK, let’s do it. Let’s write a search engine.
A crawler gets the Web pages off of that pesky Web and onto your beautiful disks. You’ll need lots of disks.
Then you need to index these pages—say which page has which words. This will tell you that Janet Jackson was found on the http://www.superbowl.com page. Usually, indexing happens locally on the disks where your crawler dumped these Web pages. Hey, why move them?
In most architectures, now you need to merge these indices so that you have one place to go to in order to find all the pages mentioning Janet Jackson’s Super Bowl performance. When you merge all these small indices, the final index will be so big that it won’t fit on one machine. This means that you’ll have to merge these small indices in such a way as to split the final big index across many machines.
Now you are ready to serve queries? Wrong. Now you build the runtime system that gets users’ queries, retrieves the results out of the index from the right machine(s), and re-ranks them according to the query. All this, while people are drumming their fingers on their desks waiting—hopefully, lots of people and, hopefully, not enough time for much drumming.