![]() |
|
||||||||||||||
|
With all the venture capital funding that went into search engines such as Lycos, Altavista, Yahoo and Google, we might think that developing a great search engine is beyond the scope of mere mortals. While such a task can be optimized endlessly, the basics of writing a reasonably fast search engine are straight-forward. There are three problems involved:
Collecting URLs to Web Pages and Other Online Documents The easiest way to obtain a list of URLs is to ask the public to submit their domain name to your search engine. You will need to write some code that can read a publicly available file from any server. The actual algorithm to then collect all of the pages under the submitted domain name is simple - you read the default file for that domain, such as index.htm or index.html, and then scan the <a href> tags within that page and iterate this process for all of the referenced files. You need to identify whether each hyperlink is a relative link, in other words there is no 'http://' at the start, or an absolute link and if you are developing a search engine that is designed to only search one domain then you can simply ignore all absolute links that point to another domain. Remember that the task at this stage is to simply collect a list of absolute file references. For example, someone may submit a domain: www.mywebsite.com and your search engine would then read the www.mydomain.com/index.htm and find references to www.mydomain.com/another.html and www.digitalscores.com/default.html. At that stage the URL list would look like:
For the second iteration your search engine would scan the iteration 1 files and we will say that it comes up with the following:
We have only gone two hyperlinks deep and you can decide on what iteration depth to use, or alternatively you can iterate indefinitely if you are creating the type of search engine designed to continually scan the internet for newly created websites. Indexing Each Page You now have a long list of pages and they have all been converted to absolute references. We now need to do some heavier work that later allows each of these pages to be searched quickly. In this step we do not care very much about how slow the process is because it only needs to be done once for every new page that is added, while that page may be searched thousands of times and only this search needs to be optimized for performance. In organizing our data for fast searches the first step is to give each unique page an identification number:
We now go through every word in every page, ignoring HTML tags, and create an index between words and pages so all of the pages can be looked up immediately from a given word. For example, if www.mydomain.com/index.html contained only the text "This is my website" and www.mydomain.com/another.html contained only "This one is better" then after adding just these pages our index would look like:
We also need to store at least some summary information about each page in our database so that we can display this information when the page is returned. You might want to just store the title and the first one hundred words of text. Google go as far as storing the entire page content and this does not sound as far fetched as you may at first think. The amount of information in all of the pages combined will always be significantly less than the amount of data we have after organizing the pages as indexed data using the above technique. A good search engine will not only describe which pages are pointed to by all the individual unique words but will also describe the positions in each page that the words point to. This involves very much more data because a word such as 'also' may appear more than ten times in a page and each position needs to be recorded. As our needs increase it is time to move to a relational database format for storing our data. Representing Page Indexes Using a Relational Database You can use a relational database that supports indexing to store the word-to-page references. This will actually reduce the size of the data by storing each page number as a 32-bit integer rather than a text number, as we have used in the previous section, while the data will be organized in more or less the same way internally if the wordID column is indexed. Consider the following design:
This database design essentially links each individual word to a long list of HTML pages, as we have done in the previous section. The defining item is that the wordID of the SE_WordToPageReference join table is marked as the main index. It is necessary that the database stores the information in this join table in such a way that select statements will be extremely fast when searched by wordID. In this database design we also store the entire HTML content of each page (column HTMLContent of SE_Page) and keep a count of how many incoming links there are to each page as an increasing number of pages are processed over time. The database also supports page importance in such a way that we can do very fast searches without requiring to join tables within select statements. In particular, a page search using the word 'tiger' can be performed using:
This will bring up the 10 most relevant documents containing the word 'tiger' as well as the position of just one reference to this text within each document. This allows the relevent part of the document to be restored in part and displayed immediately without having to perform the extra search through the document. You might be wondering whether we should be able to retrieve the all the positions of the word within the document. The problem with this is that it requires us to store much more data and does not make the search engine faster. The positions of the searched word within the document only need to be studied as the page is viewed in full, and as the whole document has to be loaded at this point the searched words can also be highlighted without any performance detriment. The column indexOrder in SE_WordToPageReference has been marked as a secondary index. This should greatly improve performance for databases supporting secondary indexing because of the "ORDER BY indexOrder" part of our select statement. Moreover, we are asking the select statement to terminate after only the top ten searches have been found. This type of select statement will be very fast if the ORDER BY clause uses an indexed column, however it is important to understand that the wordID should still be the primary index to satisfy the fundamental design discussed in the previous section.
Calculating the Order In Which Pages Are Displayed The indexOrder column was added to the SE_WordToPageReference table for the single reason of providing an sensible order for which pages are returned, as demonstrates by the select statement:
In this discussion we concentrate on how the indexOrder can be calculated for each word-to-page reference. This justifies going into some detail about how the search engine actually generates the data for all of the tables. We will consider that we have a reference to an online HTML page, either submitted by the public or found as part of an ongoing scanning process, and we wish to update the database to incorporate this page.
Database Size The above algorithm shows us that the SE_WordToPageReference table has exactly p * w records where p is the number of pages processed and w is the average number of unique words within each page. We can therefor make some predictions about how much space the database is going to take up. After processing 10,000,000 different websites we will assume that each website used an average of 100 unique words. Assume also that throughout all of the websites 100,000 unique words were found. Then SE_Word will have 100,000 records, SE_Page would have 1,000,000 records (ignoring the small records associated with unprocessed pages) and SE_WordToPageReference would have 100 * 10,000,000 = 1 billion records. If we assume that each HTML page is about 10K then SE_Page's record length will be also be approximately 10K. If the average word size is 10 characters then SE_Word's record length is only 32 bits + 10 bytes = 14 bytes while SE_WordToPageReference's record length is exactly 4 * 32 bits = 16 bytes. My multiplying the sizes of each table record by the number of records we can calculate the sizes of each table to be:
These results show that the list of unique words could actually fit on a floppy disk while both the cache of all documents and the only slightly larger reference table could fit on almost anyone's hard disk. There is nothing difficult about this sort of storage requirement for a search engine designed to efficiently search through these 100 million websites. Performing Searches by Combining Words The search engine has so far been designed to only consider searching for individual words. However searching for multiple words can be done with the same data organization and simply joining the results of separate SQL statements. For example, to search for ten best websites having the words 'tasmanian' and 'devil' the search engine could simply perform two select statements as:
This select statement considered the 100 best websites having the word 'tiger' and the 100 best website having the word 'devil' and then took the websites in common as the final result. While this search would be very quick and would usually return results that would be the top picks following a complete search, it has not scanned the entire database. If fewer than ten results were returned in this first attempt then the complete search could be followed:
This would do two complete scans of the entire database. If there is a way in SQL to use the SE_Page column positionOfFirstReferenceWithinText to limit the INTERSECTION to the top ten results by indexOrder then the performance could probably be improved significantly because both of the SQL statements could be terminated when the first 10 results were found. The method used by most search engines when searching a sentence of words is to require that all words are found. However if the intersection query does not find any results then you could design your search engine to next perform a union between the select statements. Further Observations and Ideas One of the nice things about our search engine design is that if you search for a word that is not found in any of the pages then it the search engine will be able to say almost instantaneously that the word could not be found. This is because the word would not be found in the relatively small SE_Word table and the search ends there. Similarly if you searched using the text 'This search containing agxaw will be very fast' then each word would first need to be converted to a wordID, and if 'agxaw' is not in the SE_Word table then the search engine can say straight away that the results could not be found. It is worth pointing out that the search engine has been discussed only in the context of searching for HTML documents, however there is no reason that the search engine should not incorporate any other publicly readable document on the web. The trick is to convert it to an HTML document as part of the page collection process and cache it as an HTML page while maintaining the URL reference to the original document. So there are some ideas for developing your own fairly advanced search engine. While you might not be aiming to overtake Google's 3 billion websites with your project, you could certainly use some of these ideas to create a search engine that spans your own website or a group of websites with very fast searching.
Proceed With Caution
As demonstrated, writing a fast search engine that uses a web crawler is not such a colossal programming task that some believe. However you have to proceed with care and consideration; running a web crawler might be useful for your project but not so good for other people around you. Certain web crawler implementations will overload networks and servers and this happens especially with people who are just starting to write a robot and not providing enough output to trace what their robot is doing. You will need to remain contactable while your web-crawler is in operation and try to make yourself easy to contact. If you are blocking someone's server in some unexpected way and wasting countless hours of someone, and you are not contactable, then when you are finally contacted it will be very unpleasant - making yourself contactable by any administrator observing your robot acting on their server will prevent any serious problems from occurring.
If you have not tried it before then you will quickly discover that it is an unachievable task to set your web crawler to attempt to scan the entire internet. Documents are being created far, far faster than any web crawler can scan. But given all of these points, if your web crawler does a particular, preferably unique, task very well, and is well designed and professionally operated then it can provide a valuable service.
Some good background information about Web Crawlers is available at www.robotstxt.org/wc/robots.html.
Julian Cochran |
||||||||||||||
|
© DigitalScores 2025 |
|||||||||||||||