Assignment Sample on Implementing Distributed Machine Learning Techniques
1.0 Introduction
In the present-day scenario, the scope of the Machine Learning process is vast. One such application of the machine learning process is the cluster analysis process. The cluster analysis process refers to classifying or reclassifying cluster datasets into groups with similar data points. It is a way to classify a vast amount of big data sets into different categories by identifying the similarities of characteristics in the dataset and defining them as a separate group within the dataset. This analysis process helps to organize the cluster’s datasets into a more manageable form. The process of machine learning used in cluster analysis is an unsupervised learning process, and it deals with the unlabeled data within the cluster. The application of this process can be considered vast. It is applied to organize extensive data analysis processes such as search engines, customer segmentation, Identification of cancer cells, etc.
2.0 Background
The process of clustering analysis incorporates an unsupervised machine learning process which has many applications in real-time practices such as in the process of “retrieval of information in big data sets, data mining and machine learning pipelines, statistical analysis, speech recognition and computer vision process.” It contains a dataset consisting of similar datasets within one cluster and a separate dataset with different datasets in different clusters (LeDell and Poirier, 2020). The clustering analysis’s primary goal is to automatically identify patterns and structures in the input dataset to classify their categorize the large dataset into smaller, easier to handle data groups.
In the analysis process of the Clustered data, the vector representation and Identification of the “domain Knowledge” of the input dataset helps in the identification process of the “Similarity matrices’ ‘ to identify similarity and dissimilarity between the different objects (Hu et al. 2020). “Deep metric learning approaches” extend this identification process by the implementation of the neural network in machine learning pieces of literature. One of the major complications of the implementations of the machine learning neural networks in big data analysis is the scalability of the process and the “memory bottleneck”.
3.0 Justification of Research
In the present day, the industry has observed a steep increase in the size of data that are required to be analyzed. Hence it becomes very critical to implement data categorization strategies which can be scalable for extensive data analysis and can be easily implemented in the “distributed computing platforms’ ‘ as well as different “cloud services” (Zhang et al. 2017). Due to these reasons, it is very much demanded in computer science technologies to develop a scalable approach for the differentiation of the large clutter dataset, which can be implemented in the above-mentioned sections in order to achieve the required functionality to the platform and bring the user experience as smooth as possible.
3.1 Research Aim
The primary aim of the research process is to study the different approaches in the implementation of the machine learning techniques in the Cluster datasets to classify the datasets into different sizable categories to help organize the vast data stored in the cluster datasets.
3.2 Research Objectives
The primary objective of the present research process are;
- Understand the process of the cluster dataset analysis.
- A study that represents approaches in the process of categorizing the cluster dataset.
- Identify the scope of application of machine learning techniques in the cluster datasets.
- Perform a literature review to gain a detailed understanding of the selected topic.
- Provide a comparative analysis of the traditional analysis process and the suggested machine learning-based analysis.
- Develop a project plan for the selected project.
4.0 Literature Review
According to Bateni et al. 2017, the research focuses on “minimum spanning tree (MST) based clusterings.” The research implemented “clustering techniques”, which is based on the approaches of machine learning algorithms based on spanning trees. The research has helped to provide a basic framework in order to differentiate cluster analysis techniques while demonstrating the different properties of “affinity and single-linkage clustering algorithms”.
The research has explored its ideas in primarily two segments of algorithms and clustering properties which are based on “Boruvka’s algorithm and affinity clustering” and “Kruskal’s algorithm and single-linkage clustering”. In the “affinity clustering” process, each vertex of the datasets forms an individual group. The algorithm starts with identifying the “cheapest edge going out” in each group, and in each round, continues to join similar clusters until the entire tree is formed. Whereas in the “single-linkage clustering” process, the algorithm picks a “sequential and iterative” approach in identifying the edge with the least possible weight connecting two trees in the dataset.
In the quality analysis process, the research has provided a comparative analysis of the “Single affinity, Average affinity, Complete affinity as well ask- Means” clustering algorithms for comparative analysis. The research has run the experiments on the “UCI database and use Euclidean distance.”
According to Nair et al. 2018, which discussed the application of Machine learning algorithms in the streaming and performing a real-time health data prediction process by utilizing the spark and MLib decision tree model. The research has established a Spark streaming data analysis process, which is an “open-source Big Data processing engine”. The “Spark streaming” is an “extension of Spark Application Programming Interface (API)” which has the ability to process “live data streams”. The Sparks MLib consists of different machine learning algorithms, such as “classification, clustering, collaborative filtering etc.”
For experimental purposes, the research has used data set from the heart diseases of the “UCI machine learning repository” in order to train and test the machine learning algorithms to predict heart disease. The labelled datasets were classified into 14 attribute which includes properties such as “Age sex, blood pressure level, blood sugar level, Active cholesterol, heart rate etc.” The class-level attributes the prediction of a heart disease based on a representation of scale between 1 to 4. The implemented machine learning model that was used in this segment was a decision tree with “varying maxDepth (tree depth) and maxBins (ordered splits) parameter”. The choice of the specific algorithms in the clustered dataset and categorizing application is based on the cloud platform implementation of the algorithms, which will be ab;e to predict the chances of a person having a heart disease. The application was written on “Scala”. The premises were successfully implemented in the “Amazon Elastic Compute Cloud” system. From the system, the division tree model of the implemented algorithms has predicted the results based on the healthcare attribute sent in the system and have successfully delivered messages to the users stating either “Your health status is OK “or “You are requested to consult a Cardiologist immediately” based on the prediction.
According to Zhang et al. 2017, In the data analysis process, the clustering techniques create partitions in the datasets in order to separate different objects in the dataset according to their size, quality or attributes based on the similarities between the objects. But the traditional application of the clustering algorithms is designed based on the static data structure, which makes it harder to implement in the dataset generated by the IoT devices due to the vastness and the dynamic nature of these devices. Some of the algorithms that can be used in the large dataset include “fuzzy c-means algorithms, incremental affinity propagation schemes”. The research primarily focuses its attention on the “clustering algorithm based on fast finding and searching of density peaks (CFS)” for improved cluster analysis in the dynamic dataset based on IoT devices. The research proposed an “incremental CFS algorithm based on k-medoids (ICFSKM)” for this purpose. The proposed two basic cluster options for the creation and merging purposes. Experiments on the proposed algorithms on a small scale UCI datasets and large applications validated the proposed process comparing with the “IMMFC and IAPKN” for dynamic analysis of the large scale data.
4.1 Literature Gap
From the study of the literature survey process, the research has gathered the critical insights of the selected project to gain a significant understanding of the machine learning techniques applied in the clustered dataset. The pieces of literature have suggested different processes and algorithms in the implementation of the machine learning techniques, but in the literature survey, a comparative analysis of the different methods and the improvement compared to the traditional process of the cluster analysis is not provided. Also, it is worth mentioning that in the process of the literature review, the selected dataset has been provided as more minor compared to the industrial application, and the research did not mention what scalability approaches are required to be implemented for these algorithms in order to handle industrial state databases for the practical implementation.
4.3 Future Scope
From the discussion presented in the literature gap section, it can be said that the future scopes related to the implementation of the machine learning techniques in the analysis of the cluster data analysis will be the applications of the proposed algorithms in the actual data set presented in the industrial application (Liaw et al. 208). For this purpose, the algorithms are required to be tested against in a vast section of the data, such as the informal sector or the search engine operation, which will help to provide the understanding of the real-time implementation of these techniques. TYe techniques are also required to be analyzed in terms of their scalability in an application and their capabilities in handling vast clustered datasets. The research is also required to provide a comparative analysis of the different techniques in order to understand the effectiveness of the different algorithms and suggest the appropriate application for them.
5.0 Research Methodology
5.1 Research Approach
The research approach is a strategy and technique that includes everything from basic ideas to specific data collecting, processing, and presentation procedures. As a result, it depends on the nature of the study topic being discussed (Sultana et al. 2019). Data analysis or reasoning as a method of servicing the comprehensive research in a systematic manner has proven to be effective. Here in this castle, Patra search based on the Research question “Qualitative research approach” has been chosen in order to analyze and gather data through the means of secondary sources. The “Deductive research approach” for the research is another approach that will be software with beneficial information. The Deductive approach will be a much suitable approach for this research to gather quality data through the process of secondary data collection. The technique to study that most researchers connect with scientific enquiry is a deductive one. The researcher looks at what someone else has done, examines existing ideas about whatever phenomena are researching, and then puts those ideas to the test. This approach also helps in finding the gaps in the existing research full stop while finding gaps. This will also enable the researchers to provide efficient walk while feeling those gaps or lack in the investigation.
5.2 Data Collection Methods
Data collection procedures are essential because the researcher’s methodology and interpretative approach determine how well the data is used and what conclusions may be drawn from it. Secondary data collection will be employed to acquire diverse qualitative data in this focussed investigation (Swearingen et al. 2017). Researchers acquired information through secondary data collecting. Here in this concentrated research, Secondary data collection will be used in order to gather different qualitative data. In secondary data collection and researchers gathered information from all the existing research papers, articles, websites and from different existing sources. Secondary data collection is the Sachin approach through which resources can quickly gather different information regarding the research topic. While gathering data from various sources also aids in time management and saving time for various research tasks. As technology advances, the proliferation of social media platforms and internet platforms has resulted in massive volumes of secondary data with a broad range of diversity becoming available. As a result, it allows researchers to clean up necessary knowledge on the research issue. Researchers also clean information as well as lessons on knowledge about the study you are providing via the method of acquiring data from secondary sources (Manogaran and Lopez, 2017). In this targeted research, the researcher will get more understanding about the application process of the machine learning approach in the platform of plaster data by collecting information from secondary sources. The secondary data collecting approach will also give the researcher additional information on the approaches that are employed in class today, which will be helpful for analyzing different data and providing efficient sources of knowledge and information to the research paper.
5.3 Ethical Considerations
Ethical considerations are considered to be one of the most important aspects of any research. Here in this segment, research participants are required to state it with no harm (Manogaran et al. 2018). The integrity of participants in the study ought to be a primary concern. Prior to the study, the subjects’ complete agreement should really be acquired. All of the information for this concentrated research paper will be gathered using a secondary research strategy. As a result, all forms of ethical constraints will be maintained by keeping or avoiding any source of existing information duplication. Data will be collected and used solely for the purpose of acquiring information. The confidentiality of every source of data and the research papers will be treated with the highest seriousness. This technique of study will not utilize any personal information.
6.0 Research Limitations
As part of the secondary research process presented in the current project, the research will primarily focus on the development of the literature review under the scope of the recent project. Hence the primary limitations of this research are the dependency on the pieces of literature for the presentation of the different algorithms, which will be discussed in the project (Lopez et al. 2020). The project will present the data as suggested by the literature based on their relevance to the current topic. Hence the project will bring forward any bias or practices that were included in the literature themselves. As per the scope of the project, the lack of practical development will limit the understanding of the real-time application and newer innovative applications for the suggested processes. It is also worth mentioning that the research will include the required resources from the internet and understand the application and the implementation of the generated results from the literature review in order to present a comparative analysis of the selected topic. Hence the project will be entirely dependent on the generated results from the literature and does not present the results as generated by own.
8.0 Conclusion
In this research proposal related to the implementation of the machine learning techniques in the cluster data analysis, the research has presented a transparent background of the project stating the traditional mechanisms of the cluster analysis and the implementation of the machine learning techniques. The research has then justified the choice of the project by providing the precise aim and objective of the project topic. The project proposal has then presented a brief literature review on the selected topic along with the gaps in the literature and the future scope of the research process. The project and started the chosen research methodology along with the process of the data collection and ethical consideration of the research. Finally, the research has stated the research limitations and provided an appropriate project plan for the development of the project proposal.
Reference List
Journals:
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M. and Leskovec, J., 2020. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687.
Jäger, M.O., Morooka, E.V., Canova, F.F., Himanen, L. and Foster, A.S., 2018. Machine learning hydrogen adsorption on nanoclusters through structural descriptors. npj Computational Materials, 4(1), pp.1-8.
Law, M.T., Urtasun, R. and Zemel, R.S., 2017, July. Deep spectral clustering learning. In International conference on machine learning (pp. 1985-1994). PMLR.
LeDell, E. and Poirier, S., 2020, July. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML (Vol. 2020).
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E. and Stoica, I., 2018. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118.
Lopez-Martin, M., Carro, B. and Sanchez-Esguevillas, A., 2020. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Systems with Applications, 141, p.112963.
Manogaran, G. and Lopez, D., 2017. A survey of big data architectures and machine learning algorithms in healthcare. International Journal of Biomedical Engineering and Technology, 25(2-4), pp.182-211.
Manogaran, G., Shakeel, P.M., Hassanein, A.S., Kumar, P.M. and Babu, G.C., 2018. Machine learning approach-based gamma distribution for brain tumor detection and data sample imbalance analysis. IEEE Access, 7, pp.12-19.
Nair, L.R., Shetty, S.D. and Shetty, S.D., 2018. Applying spark based machine learning model on streaming big data for health status prediction. Computers & Electrical Engineering, 65, pp.393-399.
Sultana, N., Chilamkurti, N., Peng, W. and Alhadad, R., 2019. Survey on SDN based network intrusion detection system using machine learning approaches. Peer-to-Peer Networking and Applications, 12(2), pp.493-501.
Swearingen, T., Drevo, W., Cyphers, B., Cuesta-Infante, A., Ross, A. and Veeramachaneni, K., 2017, December. ATM: A distributed, collaborative, scalable system for automated machine learning. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 151-162). IEEE.
Zhang, H., Stafman, L., Or, A. and Freedman, M.J., 2017, September. Slaq: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing (pp. 390-404).
Zhang, Q., Zhu, C., Yang, L.T., Chen, Z., Zhao, L. and Li, P., 2017. An incremental CFS algorithm for clustering large data in industrial internet of things. IEEE Transactions on Industrial Informatics, 13(3), pp.1193-1201.
Zhou, L., Pan, S., Wang, J. and Vasilakos, A.V., 2017. Machine learning on big data: Opportunities and challenges. Neurocomputing, 237, pp.350-361.
Know more about UniqueSubmission’s other writing services: