Introduction

The project is about data analysis on the loan dataset. Data analysis is the field used to find the dataset’s insights and patterns. The aim of the analysis is to find the dependency and relation between the other features of the dataset and loan approval status. The data analysis helps to find the characteristics and behavior of the customer so that the organization can make an efficient strategy. It will help the organization to increase its business growth. In this project, data analysis has been done for Zappy Financial Services(ZFS). In the approach section, the dataset has been evaluated using descriptive analysis, and statistical measures. EDA and data visualization have been done for finding the patterns and insights from the loan dataset. The entire undertaking has been finished utilizing Python language and different libraries like pandas, matplotlib, and seaborn are utilized here to execute the assignment. The software google colab is used for this project.

Approach

The data analysis was done on the Zappy Financial Services(ZFS) loan dataset. The data analysis has been done on the provided loan approval scenario. Data analysis on done to help the organization decide whether the loan should be approved or not. It will help the organization control the risk of loan approval. It will also help the ZFS company provide the marketing objectives, internal audit, and compliance (Anand et al. 2022). As the campaign starety works well for the previous year for the company. A DBA of the company stores the previous year’s loan data in a pdf file.

Also in the Excel file, the current new records have been provided. The Excel file has the missing values and also the duplicates value. Now, the task is to data analysis on the provided dataset given by the company. For the data analysis Pandas Library has played a great role and data visualization has been done to find the patterns. Matplotlib and Seaborn libraries have been used to visualize different types of plots such as pie charts, bar charts, histograms, count plots, etc. In the first task, the structure of the dataset has been observed. In the second task, the eda exploratory data analysis has been done.

The Excel dataset has been imported into the notebook for data analysis. For importing the dataset pandas library is used (Sheikh et al. 2020). It converts the dataset into the 2-dimensional form. The dataset has 247 rows and 13 columns. The dataset is displayed in tabular form.

The previous year’s dataset has been imported into the notebook for data analysis. For importing the dataset pandas library is used (Fang et al. 2021). It converts the dataset into the 2-dimensional form. The dataset has 398 rows and 13 columns. The dataset is displayed in tabular form.

After importing the two datasets into the data frame. Now it needs to concatenate two datasets row-wise in a single data frame (Odegua et al. 2020). For the concatenation, the concat method of Panda’s library has been used. Now the task is to analysis on the full concatenated dataset.

The data types of the dataset have been checked. The dataset has 16 columns. The dataset has 2 float columns, 22 integer columns, and 2 object columns. The float-type columns are co-applicant income. The object type data columns are Loan status (Madaan and Kumar 2021). The integer data type columns are Loan_Id, Gender, property_area, creadit_history,loan_amount_term, loan_amount, applicant income, self_employed, graduate dependents, married, gender. Also, the columns have no missing value. The memory usage is 81.0+ kb. For checking the information about the dataset the info method has been used here.

This is displaying the last five rows of the dataset. For the display of the last five rows, the head method is used.

Now, the task is displaying the summarization of the dataset. The various statistical measurements such as count, mean, standard deviation, first quartile, 2nd quartile, 3rd quartile, maximum value, and minimum value have been calculated. It is essential to calculate the statistical method to understand the data set. The describe method is used for calculating the summarization of the data.

Now the task is to data visualization to find the important patterns from the dataset which will help the organization to understand their customers’ patterns (Tejaswini and Kavya 2020). For the data visualization various chart such as pie charts, bar charts, and histograms various chart has been plotted. The pie chart is used to find the proportional distribution among the labels of a specific feature. The bar chart is specially used for counting the values. Histogram is used to display the distribution. Matplotlib and Seaborn are mainly used for data visualization.

This visualization shows the number of males or females having their loan status approved. The gender column has two labels such as 1’ and 2. The label 1 defines the male customers and the 2 defines the female customers. The main aim of this visualization is to calculate the distribution of the approved loan status among gender (Gallaghe et al. 2021). From the visualization, it has been understood that approximately 360 male loan status has been approved and 75 female customers have their, loan status approved. Males’ customer loan has been approved more than female. For the visualization, a histogram has been plotted. The y-axis shows the count of the gender. The range of the y-axis is from 0 to 350. Histogram is mainly used for showing the distribution. The title has been given as Gender. For plotting the histogram the hist method has been used of the matplotlib library and for displaying the title method is used. The bins have been specified as 77.

The average income of all of the applicants is depicted in this graph. A histogram has been plotted to show the applicants’ incomes. The main aim of this visualization is to understand the distribution of the income of the applicant of ZFS. The range of the income of the applicant is from 0 to 60000. From the histogram, it has been understood that the maximum customers have an income between 0 to 10000. From the histogram, it has been observed that the income of the applicant features has so many outliers.

The histogram is left-skewed (Błaszczyński et al. 2021). The mean income of the applicant is more than the median income of the applicant. The applicant’s income range is depicted on the x-axis. The y-pivot shows the count of the clients. The range of the y-axis is from 0 to 80. Histogram is mainly used for showing the distribution. The title of this histogram is displayed as applicantincome. For plotting the histogram the hist method has used the matplotlib library and for displaying the title title method is used. The bins have been specified as 77.

The average income of all self-employed applicants is depicted in this graph. The histogram for that has been plotted. From the histogram, it has been examined that the most extreme number of self employed having their typical pay is around 10000. The applicant’s income range is depicted on the x-axis. The x-hub shows the pay of the candidates (Ali and Rizvi 2021).

The number of customers is represented by the y-axis, which extends from 0 to 60000. The y-axis extends from 0 to 350. Histogram is mainly used for showing the distribution. The title of this histogram is displayed as Applicant income of the self-employed. The x-axis shows the range of the income of the applicant. For plotting the histogram the hist method has used the matplotlib library and for displaying the title title method is used. The bins have been specified as 77.

The average income of all applicants who are not self-employed is depicted in the graph. For showing the typical pay of all non-independently employed candidates a histogram has been plotted. The self-employed column has two labels such as 0’ and 1. The label 0 defines the non-self-employed applicants and the 1 defines the self-employed applicants.

From the visualization, the y-axis has shown the counts of the non-self-employed applicants. For plotting the histogram the hist method has used the matplotlib library and for displaying the title title method is used. The bins have been specified as 77.

This graph is showing the average income of all the graduate applicants. For the visualization histogram has been used. The graduate column has two labels such as 0’ and 1. The label 0 defines the graduate applicants and the 1 defines the non-graduate customers. The histogram is plotted for visualization of the distribution of the graduate applicants.

From the histogram, it has been analyzed that the maximum number of applicants are graduates more than non-graduate applicants. The number of graduate applicants is approximately 75 and the non-graduate applicants are approximately 350. The x-axis shows the label of the graduate and the non-graduate applicants which is defined as 0, and 1’.

The y-axis depicts the value count of the graduate applicants and the non-graduate applicants. The range of the y-axis is from 0 to 300. The title has been displayed as Graduate. For plotting the histogram the hist method has used the matplotlib library and for displaying the title title method is used. The bins have been specified as 77.

This chart is showing the level of graduate and non-graduate candidates that had their advance status endorsed. The pie chart has been plotted for the given percentage. The pie chart is mainly used for the proportion distribution (Çığşar, et al. 2020). The main aim of this chart is to display the percentage distribution of the graduate and non-graduate applicants whose loan has already been approved. From this pie chart, it has been understood that the loan status has mainly been approved for most of the graduate applicants and the less number of graduate applicants has been approved.

The loans of 72.05% of graduate applicants have been approved, while the loans of 27.95% of graduate applicants have not. There are two labels on the pie chart, Y’ and N. Y’ defines the graduate applicants, and N defines the non-graduate applicants (Patel et al. 2020).

Filtration has been done on the graduate applicants. The value counts of the loan_status column have been calculated on the filtered data. For plotting the pie chart the pie method has been implemented. In the pie method, the value count of the loan status and in the label has been passed. Also, the autopct is defined as 1.2f. For showing the pie chart the show method has been used.

Recommendations for future work

The analysis of data has been finished on the loan dataset of the ZFS organization. The fundamental information examination has been finished on the dataset utilizing factual estimation and exploratory information investigation and information representation. Seaborn and matplotlib were utilized from all the pandas library. This still has the potential to be improved in the future. The machine learning algorithm can be implemented to train the model.

Machine learning is a field that used statistical calculation to build the model. The dataset already has the dependent column which is loan Status. The supervised machine learning algorithm such as logistic regression, multinomial naive Bayes, and decision tree classifier can be implemented. The labeled data has provided for the Supervised learning.

The loan dataset for ZFS also the label column. So the dataset has been suitable for supervised machine learning. Also for getting a more advanced model, ensemble learning such as random forest classifier, gradient boosting, AdaBoost stacking, bagging, and boosting can be used to train the model. This machine-learning model could be used the predict the dependent feature such as loan_satatus. Also, the clustering algorithm can be implemented to understand the characteristics and behavior of the customers.

There are many more clustering algorithms such as agglomerative clustering, divisive clustering, Dbscan, and Kmeans that can be implemented. By implementing this clustering algorithm, applicants can be divided into groups based on similar characteristics and behaviors of the customers. Organizations can identify an applicant for a particular customer segment. Based on examining that segment the customers can easily decide wheater the loan should be approved for the applicants or not.

Conclusions

It has been concluded that the depth analysis is done on the loan dataset. In the first section, the data has been imported. After that, the structure of the dataset and the summarization has done on the dataset. It provides the overall descriptive details of the applicants of the organization. After that, the data visualization has been done to find the pattern from the dataset.

It will help an organization to decide whether the loan status should be approved for the applicant or not. From the visualization, some business information can be easily interpreted. The loan status has mainly been approved for most of the graduate applicants and the less number of graduate applicants has been approved. The loans of 72.05% of graduate applicants have been approved, while the loans of 27.95% of graduate applicants have not.

The maximum number of self-employed having an average income is around 10000. This will help the ZFS organization to understand its customers. It will help the organization to make the right decision. By that, the organization’s business growth will be increased. Also lastly the recommendation for future work on the loan dataset has been stated there. In the recommendation, the steps have been given to improve the loan data analysis.

Reference List

Journals

Anand, M., Velu, A. and Whig, P., 2022. Prediction of loan behaviour with machine learning models for secure banking. Journal of Computer Science and Engineering (JCSE), 3(1), pp.1-13.

Sheikh, M.A., Goel, A.K. and Kumar, T., 2020, July. An approach for prediction of loan approval using machine learning algorithm. In 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC) (pp. 490-494). IEEE.

Fang, W., Li, X., Zhou, P., Yan, J., Jiang, D. and Zhou, T., 2021. Deep learning anti-fraud model for internet loan: where we are going. IEEE Access, 9, pp.9777-9784.

Odegua, R., 2020. Predicting bank loan default with extreme gradient boosting. arXiv preprint arXiv:2002.02011.

Madaan, M., Kumar, A., Keshri, C., Jain, R. and Nagrath, P., 2021. Loan default prediction using decision trees and random forest: A comparative study. In IOP Conference Series: Materials Science and Engineering (Vol. 1022, No. 1, p. 012042). IOP Publishing.

Tejaswini, J., Kavya, T.M., Ramya, R.D.N., Triveni, P.S. and Maddumala, V.R., 2020. Accurate loan approval prediction based on machine learning approach. Journal of Engineering Science, 11(4), pp.523-532.

Ray, R., Gallagher, K.P., Kring, W., Pitts, J. and Simmons, B.A., 2021. Geolocated dataset of Chinese overseas development finance. Scientific Data, 8(1), p.241.

Błaszczyński, J., de Almeida Filho, A.T., Matuszyk, A., Szeląg, M. and Słowiński, R., 2021. Auto loan fraud detection using dominance-based rough set approach versus machine learning methods. Expert Systems with Applications, 163, p.113740.

Ali, S.E.A., Rizvi, S.S.H., Fong-Woon, L., Rao, F.A. and Jan, A.A., 2021. Predicting delinquency on Mortgage loans: an exhaustive parametric comparison of machine learning techniques. International Journal of Industrial Engineering and Management, 12(1), p.1.

Patel, B., Patil, H., Hembram, J. and Jaswal, S., 2020, June. Loan default forecasting using data mining. In 2020 International Conference for Emerging Technology (INCET) (pp. 1-4). IEEE.

Kumar, A., Dugyala, R. and Bhattacharya, P., 2021, July. Prediction of Loan Scoring Strategies Using Deep Learning Algorithm for Banking System. In Innovations in Information and Communication Technologies (IICT-2020) Proceedings of International Conference on ICRIHE-2020, Delhi, India: IICT-2020 (pp. 115-121). Cham: Springer International Publishing.

Cheng, D., Niu, Z. and Zhang, L., 2020. Delinquent events prediction in temporal networked-guarantee loans. IEEE Transactions on Neural Networks and Learning Systems.

Hou, W.H., Wang, X.K., Zhang, H.Y., Wang, J.Q. and Li, L., 2020. A novel dynamic ensemble selection classifier for an imbalanced data set: An application for credit risk assessment. Knowledge-Based Systems, 208, p.106462.

Sahoo, K., Samal, A.K., Pramanik, J. and Pani, S.K., 2019. Exploratory data analysis using Python. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 8(12), p.2019.

Çığşar, B. and Ünal, D., 2019. Comparison of data mining classification algorithms determining the default risk. Scientific Programming, 2019.

Know more about UniqueSubmission’s other writing services:

Assignment Writing Help

Essay Writing Help

Dissertation Writing Help

Case Studies Writing Help

MYOB Perdisco Assignment Help

Presentation Assignment Help

Proofreading & Editing Help