Hadoop and Map Reduce

Rationale

The main objective of this journal is to build a parallelized Vertical Search Engine on Apache Hadoop cluster by taking seed URLs of computer domain mining of Wikipedia. The extracted web pages are a creep and parsed using Apache Nutch crawler and it is stored in Apache HBase. The stored web pages are used for linguistic processing to remove the stop words and stemming the performance. It also uses the ranking algorithm to rank the results. It focuses on obtaining the most relevant results in a short time using distributed processing.

Methods

In search engines queries are mostly too short or indistinguishable and produce too many results which include irrelevant data. The obtained results are not browsed by the individual since it has a large number of results. The use of mobile phones and other ubiquitous devices has small screens has limited use. To produce relevant search results in less time a vertical search engine by parallelizing the vertical search engine using Hadoop and map-reduce a comparison is made with a non-parallelized vertical search engine. The method used is the distributed database over the Hadoop cluster and perform the search using the map-reduce. Map-reduce has been divided into three categories based on the existing deficiencies (Khezr and Navimipour, 2017) based on the improvement goals as a native variant, apache variant and third-party extensions. Most of the investigation is based on the Hadoop platform to enhance the database process and schedules in Hadoop map and reduce. The applications of Map-reduce is a promising technology like telecom, manufacturing, government organization, medical, etc. (Hashem IAT et al, 2016).

Tools

It combines the different advantages of vertical search engines like relevant findings with a computing capacity of parallel processing along with the Hadoop cluster and Map-reduce programming. The components used are cluster, database, URL dB, Crawler, Linguistic processing, Inverted Index and Ranking algorithm. Cluster formation is done using Apache Hadoop to provide the advantage of the distributed file system (HDFS). The database used in HBase is distributed non-relational database to facilitate the fault-tolerant. (Joshi and Mulay, 2018). Wikipedia is used as an URL dB as a seed to the crawl to bring the Robust and Scalable. Linguistic processing is used to remove the stop words like pronouns, conjunctions, prepositions, punctuations, symbols etc. and filter them. Porter-stemmer algorithm is used for map-reduce programming. URL indexing is done and the pages are ranked using the ranking algorithm.

Findings

The performance of the vertical search is reduced when the numbers of nodes in the cluster are increased due to the communication latency. The number of nodes and URLs are kept constant and the keywords are changed and the results produced are exponentially increased in time to fetch the result and the performance was decreased.

Conclusion

The aim of this article is to provide an alternative for search engines that provides relevant data and reduce the overall time spend in the search process. Parallelization provides a good performance benefit but has many limits in future it will be used to replace the search engine by increasing efficiency.

References

Hashem, I.A.T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F. and Khan, S.U. ,2016. MapReduce: Review and open challenges. Scientometrics, 109(1), pp.389–422.

Joshi, R. and Mulay, P. 2018. Mind Map based Survey of Conventional and Recent Clustering Algorithms: Learning’s for Development of Parallel and Distributed Clustering Algorithms. International Journal of Computer Applications, 181(4), pp.14–21.

Khezr, S.N. and Navimipour, N.J. 2017. MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research. Journal of Grid Computing, 15(3), pp.295–321.

Book Your Assignment

Get It Done Today

Upload Your Assignment

I agree with Privacy Policy and Terms & Conditions (Recommended)

8740

1,212,718

Orders

4.9/5

Rating

5,063

Experts

Unique submission is a great agency… Unique submission is a great agency that can help with your assignment anytime any day Very affordable and fast, they also have listening hear and work with correction Kudos to them

I am very happy about my results because my friend is suggest uniqe submission I am trust and give to them.they are given to my cource work intime

Highly recommended to those who want quality assignments,Recently I’ve got 85% and 80% in AI & Cloud computing,Response from you was greatly appreciative ,thanks a lot for helping me

Hadoop and Map Reduce

Rationale

Methods

Tools

Findings

Conclusion

References

Leave a Comment Cancel reply

Get It Done Today

1,212,718

4.9/5

5,063

Highlights

21 Step Quality Check

2000+ Ph.D Experts

Money Back Guarantee

Live Expert Sessions

Earn while you Learn with us

Confidentiality Agreement

Assignment Services

Quick Links

Services

Contact Info

Best in countries

Find Us On

Call US

Trusted By

Unique Submission Help Rated 4.9/5 based on 75682 customer reviews