Hadoop and Map Reduce


The main objective of this journal is to build a parallelized Vertical Search Engine on Apache Hadoop cluster by taking seed URLs of computer domain mining of Wikipedia. The extracted web pages are creep and parsed using Apache Nutch crawler and it is stored in Apache HBase. The stored web pages are used for linguistic processing to remove the stop words and stemming the performance. It also use the ranking algorithm to rank the results. It focuses on obtaining the most relevant results in a short time using distributed processing.


In search engines queries are mostly too short or indistinguishable and produce too many results which include irrelevant data. The obtained results are not browsed by the individual since it has a large number of results. Use of mobile phones and other ubiquitous devices has small screen has limited use. To produce a relevant search results in less time a vertical search engine by parallelize the vertical search engine using Hadoop and map reduce a comparison is made with non-parallelized vertical search engine. The method used is the distributed database over the Hadoop cluster and perform the search using the map reduce. Map reduce has been divided into three categories based on the existing deficiencies (Khezr and Navimipour, 2017) based on the improvement goals as a native variants, apache variant and the third-party extensions. Most of the investigation is based on Hadoop platform to enhance the database process and schedules in Hadoop map and reduce. The applications of Map reduce is a promising technology like telecom, manufacturing, government organization, medical, etc. (Hashem IAT et al, 2016).


It combines the different advantages of vertical search engine like relevant finding with computing capacity of parallel processing along with the Hadoop cluster and Map reduce programming. The components used are cluster, database, URL db, Crawler, Linguistic processing, Inverted Index and Ranking algorithm. Cluster formation is done using the Apache Hadoop to provide the advantage of the distributed file system (HDFS). Database used is HBase which is distributed non relational database to facilitate the fault tolerant.(Joshi and Mulay, 2018). Wikipedia is used as an URL db as a seed to the crawl to bring the Robust and Scalable. Linguistic processing is used to remove the stop words like pronouns, conjunctions, prepositions, punctuations, symbols etc. and filter them. Porter-stemmer algorithm is used for map-reduce programming. URL indexing is done and the pages are ranked using the ranking algorithm.


The performance of the vertical search is reduced when the numbers of nodes in the cluster are increased due to the communication latency. The number of nodes and URLs are kept constant and the keywords are changed and the results produced are exponentially increased in time to fetch the result and the performance was decreased.


The aim of this article is to provide an alternate for search engines that provides a relevant data and reduce the overall time spend in the search process. Parallelization provides a good performance benefits but has many limits in future it will be used to replace the search engine by increasing the efficiency.


Hashem, I.A.T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F. and Khan, S.U. ,2016. MapReduce: Review and open challenges. Scientometrics, 109(1), pp.389–422.

Joshi, R. and Mulay, P. 2018. Mind Map based Survey of Conventional and Recent Clustering Algorithms: Learning’s for Development of Parallel and Distributed Clustering Algorithms. International Journal of Computer Applications, 181(4), pp.14–21.

Khezr, S.N. and Navimipour, N.J. 2017. MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research. Journal of Grid Computing, 15(3), pp.295–321.

Leave a Comment