A Novel Approach on Focused Crawling With Anchor Text

Authors

  • S. Subatra Devi Assistant Professor, Hindustan College of Arts & Science, Chennai, Tamil Nadu, India

DOI:

https://doi.org/10.51983/ajcst-2018.7.1.1849

Keywords:

Focused Crawler, Hyperlink, Anchor Text, Sibling, World Wide Web

Abstract

A novel approach with focused crawling for various anchor texts is discussed in this paper. Most of the search engines search the web with the anchor text to retrieve the relevant pages and answer the queries given by the users. The crawler usually searches the web pages and filters the unnecessary pages which can be done through focused crawling. A focused crawler generates its boundary to crawl the relevant pages based on the link and ignores the irrelevant pages on the web. In this paper, an effective focused crawling method is implemented to improve the quality of the search. Here, three learning phases are considered namely, content-based, link-based and sibling-based learning are undergone to improve the navigation of the search. In this approach, the crawler crawls through the relevant pages efficiently and more relevant pages are retrieved in an effective way. It is proved experimentally that more number of relevant pages are retrieved for different anchor texts with three learning phases using focused crawling.

References

P. De Bra, G-J. Houben, Y. Kornatzky, and R. Post, "Information Retrieval in Distributed Hypertexts," in Proceedings of RIAO '94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, pp. 481-491, 1994.

K. Bharat and M. Henzinger, "Improved Algorithms for Topic Distillation in a Hyperlinked Environment," in Proceedings of ACM SIGIR ’98 conference on Research and Development in Information Retrieval, pp. 104-111, 1998.

S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," in Computer Networks and ISDN Systems, vol. 30, no. 1-7, Elsevier, pp. 107-117, 1998.

M. Hersovici, et al., "The Shark-Search Algorithm - an Application: Tailored Web Site Mapping," Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, vol. 30, no. 1-7, pp. 317-326, April 1998.

J. Kleinberg, "Authoritative sources in a hyperlinked environment," in Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668-677, January 1998.

J. Cho, H. Garcia-Molina, and L. Page, "Efficient Crawling through URL Ordering," in Proceedings of the Seventh World-Wide Web Conference, Elsevier Science, pp. 161-172, 1998.

J. Dean and M. R. Henzinger, "Finding Related Pages in the World Wide Web," 8th World Wide Web Conference, Elsevier Science, pp. 1467-1479, 1999.

S. Chakrabarti, et al., "Focused Crawling: A New Approach for Topic-Specific Web Resource Discovery," Elsevier Science, pp. 545-562, 1999.

B.D. Davidson, "Topical Locality in the Web," in Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272-279, 2000.

M. Diligenti, et al., "Focused crawling using context graphs," Proceedings of 26th International Conference on Very Large Data Bases, 2000.

M. Najork and J. Wiener, "Breadth-first search crawling yields high-quality pages," in 10th Int. World Wide Web Conference, Hong Kong, ACM, pp. 114–118, 2001.

C. Aggarwal, et al., "Intelligent Crawling on the World Wide Web with Arbitrary Predicates," Proceedings of the 10th International World Wide Web Conference, Hong Kong, pp. 96-105, May 2001.

G. Pant, P. Srinivasan, and F. Menczer, "Crawling the Web," Springer, pp. 153-178, 2004.

Z. Wang, "Improved Link-Based Algorithms for Ranking Web Pages," Springer, pp. 291-302, 2004.

B. Novak, "A Survey of focused Web Crawling Algorithms," Proceedings of SIKDD, pp. 55-58, 2004.

J. Li, et al., "Focused Crawling by Exploiting Anchor Text Using Decision Tree," 14th International Conference on WWW ’05, ACM, pp. 1190-1191, 2005.

F. Fangluo, et al., "An improved Fish-Search Algorithm for Information Retrieval," IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 523-528, 2005.

G. Pant, P. Srinivasan, and F. Menczer, "A General Evaluation Framework for Topical Crawlers," Information Retrieval, Springer, vol. 8, no. 3, pp. 417-447, 2005.

P. Tao, et al., "A new Framework for Focused Web Crawling," Wuhan University Journal of Natural Science (WUJNS), 2006.

S. Shah, "Implementing an Effective Web Crawler," September 2006.

Z. Chen, et al., "An Improved Shark-Search Algorithm Based on Multi-information," Fourth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 659 – 658, Aug 24-27, 2007.

G. Almpanidis and C. Kotropoulos, "Combining Text and Link Analysis for Focused Crawling —An Application for Vertical Search Engines," Information Systems, vol. 32, no. 6, pp. 886-908, 2007.

A. Pal, et al., "Effective Focused Crawling Based on Content and Link Structure Analysis," International Journal of Computer Science and Information Security, vol. 2, 2009.

D. Hati and A. Kumar, "An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler," International Journal of Computer Applications, vol. 2, no. 3, pp. 48-53, 2010.

P. Rodriguez-Mier, et al., "Automatic Web service Composition with a Heuristic-based Search Algorithm," IEEE International Conference on Web Services, 2011.

Y. Yongsheng and W. Hui, "Implementation of Focused crawler," Journal of Computers, vol. 6, no. 1, January 2011.

A. Patel and N. Schmidt, "Application of structured document parsing to focused web crawling," Computer Standards & Interfaces, Elsevier, vol. 33, no. 3, pp. 325-331, March 2011.

P. Sudhaka, G. Poonkuzhali, and R. Kishore Kumar, "Content Based Ranking for Search Engines," Proceedings of the International Multi Conference of Engineers and Computer Scientists, 2012.

A. D. Deore and R. L. Paikrao, "Ranking Based Web Search Algorithms," International Journal of Scientific and Research Publications, vol. 2, 2012.

L. Yan, W. Du, Y. Wei, and H. Chen, "A Novel Heuristic Search Algorithm based on Hyperlink and Relevance Strategy for Web Search," Advances in Intelligent and Soft Computing, Springer, vol. 149, pp. 97-102, 2012.

A. Jain, R. Sharma, G. Dixit, and V. Tomar, "Page Ranking Algorithms in Web Mining, Limitations of Existing methods and a New Method for Indexing Web Pages," International Conference on Communication Systems and Network Technologies, IEEE, pp. 640-645, 2013.

H. Liu, J. He, D. Zhu, C. X. Ling, and X. Du, "Measuring Similarity Based on Link Information: A Comparative Study," IEEE Transactions on Knowledge and Data Engineering, vol. 25, pp. 2823-2840, 2013.

S. S. Devi and P. S. Abdul Khader, "A Comparative Study of Four Measures on Web Information Retrieval," International Journal of Internet and Web Technology, vol. 38, no. 1, pp. 1107-1112, 2013.

V. A. Narayana, P. Premchand, and A. Govardhan, "Effective Detection of Near Duplicate Web Documents in Web Crawling," International Journal of Computational Intelligence Research, vol. 5, no. 1, pp. 83–96, 2009.

M. Bacchin, N. Ferro, and M. Melucci, "The Effectiveness of a Graph-Based Algorithm for Stemming," Proceedings of the 5th International Conference, ACM, pp. 117-128, 2002.

R. Gaur and D. K. Sharma, "Review of ontology based focused crawling approaches," International Conference on SoftComputing Techniques for Engineering and Technology, IEEE, 2014.

D. Goyal and M. Kalra, "A novel prediction method of relevancy for focused crawling in topic specific search," International Conference on Signal Propagation and Computer Technology, IEEE, 2014.

A. Bhardwaj and V. Mangat, "A novel approach for content extraction from web pages," Recent Advances in Engineering and Computational Sciences, IEEE, 2014.

G. H. Agre and N. V. Mahajan, "Keyword focused web crawler," International Conference on Electronics and Communication Systems, IEEE, pp. 1089–1092, 2015.

S. Sharma and P. Gupta, "The anatomy of web crawlers," International Conference on Computing, Communication & Automation, IEEE, 2015.

S. Bai, S. Hussain, and S. Khoja, "A framework for focused linked data crawler using context graphs," International Conference on Information and Communication Technologies, IEEE, pp. 1-6, 2015.

A. Gupta and P. Anand, "Focused web crawlers and its approaches," International Conference on Futuristic Trends on Computational Analysis and Knowledge Management, IEEE, 2015.

N. Kumar and M. Singh, "Framework for Distributed Semantic Web Crawler," International Conference on Computational Intelligence and Communication Networks, IEEE, 2015.

S. Bai, S. Hussain, and S. Khoja, "A framework for focused linked data crawler using context graphs," International Conference on Information and Communication Technologies, IEEE, 2015.

D. K. Sharma and M. A. Khan, "SAFSB: A self-adaptive focused crawler," 1st International Conference on Next Generation Computing Technologies, IEEE, 2015.

L. Deri, M. Martinelli, D. Sartiano, and L. Sideri, "Large scale web-content classification," 7th International Joint Conference on Knowledge Engineering and Knowledge Management, IEEE, 2015.

Downloads

Published

22-01-2018

How to Cite

Devi, S. S. (2018). A Novel Approach on Focused Crawling With Anchor Text. Asian Journal of Computer Science and Technology, 7(1), 7–15. https://doi.org/10.51983/ajcst-2018.7.1.1849