Login

gruendic · 08-17-2017, 12:24 AM

Peer to Peer (P2P) Search Engine

World Wide Web (WW) is emerging as a source of online information at a very faster rate.It s content is considerably more diverse and certainly much larger than what is commonly understood. Information content in WW is growing at a rate of 200% annually. The sheer volume of information available makes searching for specific information quite a daunting task. Search engines are efficient tools used for finding relevant information in the rapidly growing and highly dynamic web. There are quite a number of search engines available today. Every search engine consists of three major components: crawler, indexed repository and search software. The web crawler fetches web pages (documents) in a recursive manner according to a predefined importance metric for web pages. Some example metrics are back page link count of the page, forward page link count, location, page rank etc. The Indexer parses these pages to build an inverted index that is then used by the search s oftware to return relevant documents for a user query.

Even though search engine is a very useful tool, the technology is young and has quite a number of problems, which become worse as the web grows rapidly. Searching in the web today is like dragging a net on the surface of an ocean and therefore, missing out the information in the deep. The reason for this is simple: basic search methodology and technology have not evolved significantly since the inception of the Internet. WW consists of the surface web (the visible part of the web consisting of static documents) and deep web (which consists of documents those are hidden in searchable databases and generated dynamically on demand). Deep web is currently 400 to 550 times larger than the surface web and is of much higher quality.

Traditional search engines create their index by crawling the surface web. Crawlers can fetch documents, which are static and linked from other documents. Dynamic pages cannot be fetched by crawlers and hence, cannot be in dexed by traditional search engines. Dynamic pages are often generated by scripts that need information like cookie data, session id or query string before they generate the content. The crawler has no way to figure out what information to give at different databases to produce dynamic pages, which makes it impossible for them to fetch the pages. If the spider tries to wander deep into a site, it could enter a never-ending loop where request for a page by the spider is met with a request for information from the server. This leads to a poor performance by the spider and a potential crash of the web server due to repeated requests from the spider.

The only way of searching the deep web is by sending direct queries to their searchable databases. But the process of one at a time direct query to different deep websites is a time consuming and laborious process. In our peer -to-peer search engine PtoP, we have automated the process of sending query to deep web sites. The client side tool allows the user to search for a query and peer-to-peer technology is used to propagate the user query automatically to a large number peer sites. The results obtained from different sites are integrated and presented to the user. The advantages of using PtoP are that it obtains f resh and up-to-date information from the sites, it eliminates the risk of single point of failure as the network can work even if few peer servers are down, it can search for a file in the file system given the filename as the key word (this is not possibl e for traditional search engines like htdig). Effort has been made to keep the communication load between peer severs as low as possible. Also PtoP s peer sever interface is capable of interacting with any local or external search engine.