Analyzing Log Files for Web Intrusion Investigation Using Hadoop
Topic: Hadoop and Mapreduce
Research Paper Title: Analyzing Log Files for Web Intrusion Investigation Using Hadoop
Objective of the paper
The main objective of this paper is to analyze the web log files which are stored in the web server for finding the web intrusions by the attackers using Hadoop and Mapreduce. Here, the performance analysis of this research is done based on the execution time against the different size of sample log files.
The management of a website with lakhs of user request per day is very challenging where the website must guarantee highly accessible for everyone and preventing them from security attacks. In case of any serious incidents, the security investigators have to analyze the web log file which is stored in the web server (Vernekar & Buchade, 2013).
The web log files may be in size from terabytes to petabytes. Analyzing the web log file using the traditional methods is not possible since it cannot handle large datasets. The web log file may contain unstructured, incomplete and noisy data which will lead the performance of traditional log analyzers to become worse. So, the investigators consider this web log files as Big Data and used Hadoop for analyzing the log files for web intrusion detection (Therdphapiyanak & Piromsopa, 2013).
The implementation is done in two steps. Here the samples of log files from Apache Web server are taken for experimentation.
The first step is preprocessing which is used to remove the missing values, irrelevant and illegal entries from the web log files (Latib, Ismail, Yusop, Magalingam, & Azmi, 2018). Then the data in the log file will be transformed to a format which is desirable for next step.
Here the preprocessing algorithm is implemented with python code. The main fields in the transformed log file are ‘Method’, ‘URI’ and ‘HTTP protocol’. In this research, a new field is added which shows the country code that is generated by pygeoip with the input IP address. This field is useful to find the origin of the country from which the intrusion is performed.
The second step is log analysis which is performed using the Hadoop framework. Here HDP is used to analyze the log file, In order to run the implementation in single node, Hortonworks Sandbox is used.
The preprocessed log file is uploaded to HDFS and then the analysis is done using Hive since the log files are semi structured. The HiveQL queries are created using query editors and the execution of queries are monitored using Apache Ambari.
Here, the same sets of queries are executed on three samples of log files with different size. The output of preprocessing step is analyzed and it shows that the size of the log files is decreased by more than 50 % by removing unwanted data. The analysis shows the information such as top IP addresses, URIs and countries of web users. This information is then visualized using Power views excel and Hive visualization.
The power view excel shows the view of web users of different countries. The Hive visualization shows the bar chart of number of requests from each IP address. The chart shows the IP address with more number of requests and this information can be useful for investigators.
This research shows that the Hadoop framework can be used for web log analysis for finding the web intrusion investigation. The experimental result shows that the execution time will not linearly increase with the size of the log file. This research proves that the Hadoop performs well in analyzing the log files of large size.
Hingave, H., & Ingle, R.,2015. An approach for MapReduce based log analysis using Hadoop. ICECS’15 (pp. 115-118). Coimbatore: IEEE.
Latib, M. A., Ismail, S. A., Yusop, O. M., Magalingam, P., & Azmi, A. ,2018. Analysing Log Files For Web Intrusion Investigation Using Hadoop. ICSIE ’18 (pp. 12-21). Cairo: ACM.
Therdphapiyanak, J., & Piromsopa, K. ,2013. Applying Hadoop for log analysis toward distributed IDS. ICUIMC ’13 (pp. 1-6). New York: ACM.
Vernekar, S. S., & Buchade, A. ,2013. MapReduce based log file analysis for system threats and problem identification. IACC’13 (pp. 831-835). Ghaziabad: IEEE.