Analyzing Log Files for Web Intrusion Investigation Using Hadoop
Topic: Hadoop and Mapreduce
Research Paper Title: Analyzing Log Files for Web Intrusion Investigation Using Hadoop
The objective of the paper
The main objective of this paper is to analyze the weblog files which are stored in the webserver for finding the web intrusions by the attackers using Hadoop and MapReduce. Here, the performance analysis of this research is done based on the execution time against the different sizes of sample log files.
The management of a website with lakhs of user requests per day is very challenging where the website must guarantee to be high accessibility for everyone and preventing them from security attacks. In case of any serious incidents, the security investigators have to analyze the weblog file which is stored in the webserver (Vernekar & Buchade, 2013).
The weblog files may be in size from terabytes to petabytes. Analyzing the weblog file using the traditional methods is not possible since it cannot handle large datasets. The weblog file may contain unstructured, incomplete and noisy data which will lead the performance of traditional log analyzers to become worse. So, the investigators consider these weblog files as Big Data and used Hadoop for analyzing the log files for web intrusion detection (Therdphapiyanak & Piromsopa, 2013).
The implementation is done in two steps. Here the samples of log files from the Apache Web server are taken for experimentation.
The first step is preprocessing which is used to remove the missing values, irrelevant and illegal entries from the weblog files (Latib, Ismail, Yusop, Magalingam, & Azmi, 2018). Then the data in the log file will be transformed to a format that is desirable for the next step.
Here the preprocessing algorithm is implemented with python code. The main fields in the transformed log file are ‘Method’, ‘URI’ and ‘HTTP protocol’. In this research, a new field is added which shows the country code that is generated by pygeoip with the input IP address. This field is useful to find the origin of the country from which the intrusion is performed.
The second step is log analysis which is performed using the Hadoop framework. Here HDP is used to analyze the log file, In order to run the implementation in a single node, Hortonworks Sandbox is used.
The preprocessed log file is uploaded to HDFS and then the analysis is done using Hive since the log files are semi-structured. The HiveQL queries are created using query editors and the execution of queries are monitored using Apache Ambari.
Here, the same sets of queries are executed on three samples of log files of different sizes. The output of preprocessing step is analyzed and it shows that the size of the log files is decreased by more than 50 % by removing unwanted data. The analysis shows the information such as top IP addresses, URIs and countries of web users. This information is then visualized using Power views excel and Hive visualization.
The power view excel shows the view of web users of different countries. The Hive visualization shows the bar chart of a number of requests from each IP address. The chart shows the IP address with more number of requests and this information can be useful for investigators.
This research shows that the Hadoop framework can be used for weblog analysis for finding the web intrusion investigation. The experimental result shows that the execution time will not linearly increase with the size of the log file. This research proves that Hadoop performs well in analyzing log files of large size.
Hingave, H., & Ingle, R.,2015. An approach for MapReduce based log analysis using Hadoop. ICECS’15 (pp. 115-118). Coimbatore: IEEE.
Latib, M. A., Ismail, S. A., Yusop, O. M., Magalingam, P., & Azmi, A. ,2018. Analysing Log Files For Web Intrusion Investigation Using Hadoop. ICSIE ’18 (pp. 12-21). Cairo: ACM.
Therdphapiyanak, J., & Piromsopa, K. ,2013. Applying Hadoop for log analysis toward distributed IDS. ICUIMC ’13 (pp. 1-6). New York: ACM.
Vernekar, S. S., & Buchade, A. ,2013. MapReduce based log file analysis for system threats and problem identification. IACC’13 (pp. 831-835). Ghaziabad: IEEE.