Tags:Focused Crawling, Prediction of New Outlinks, Probabilistic Prediction, Web Analytics, Web Change Prediction and Web Crawling
Abstract:
Rapid dynamics of the World Wide Web represent a challenge for crawling and indexing web pages. The challenge is encountered on daily basis by focused crawlers in their task to provide businesses with timely and complete information on selected areas of the web. In this work, we introduce prediction models for two metrics that are important in organizing the crawling order of pages: the number of new outlinks, and the change rate of a page. The results show that static page features such as content and text length have high predicting value for change rate and new outlinks. Moreover, theconsistency in formation of new outlinks in the provided data results in highquality predictions using only short-history features