# Team Research and Development

## Introduction

Data Analysis can be said as a process of transforming modelling and cleaning data and out of which some useful information can be extracted. The main objective of the data analysis is to segregate a meaningful information and upon which some important decisions can be made. A number of decisions can be made with the historical data. Data analysis can be applied to improve business, predicting the future of a particular stock, analyse the weather report, predicting the existence of a particular disease and many more. The data can be any type whether it is structured or unstructured everything can be taken for analysis and some useful information can be digged out of it. (Guru, 2020) This is known as KDD process (Knowledge Discovery in Databases) in data mining. There are different forms of Data Analysis and some of them are listed below:

• Analysis of text
• Statistical Analysis process
• Diagnostic Analysis process
• Predictive Analysis process
• Prescriptive Analysis process

Each form of analysis will perform in their own way and here in this report we are going to describe about Statistical analysis. Statistical Analysis answers the question of what has happened. Based on the past data. Statistical analysis involves the collection of data, analysing the data, interpret results from data, presenting results to the viewers and building data model out of it. Statistical analysis does the analysis with a set of data or it can continue with sample data (Statistics, 2019). This can be further classified inti two types:

• Descriptive Analysis

This type of analysis inspects a complete data or it does analysis with a sample numerical data. It calculates continuous mean and deviation. It calculates the percentage and frequency of data which is of categorical type (A., 2019).

• Inferential Analysis

It analyses data from the whole data and we can identify many conclusions from the sample data from the different samples available.

Thus the different forms of analysis can be done with this and in this work we are going implement the statistical analysis for the chosen dataset (Statistics, 2019).

## Description of Dataset

The dataset taken for analysis is Student data which comprises of 33 columns which describes the data. The dataset is taken from kaggle website https://www.kaggle.com/uciml/student-alcohol-consumption?select=student-mat.csv

The dataset taken for analysis comprises of categorical and continuous data which consists of 395 observations of 33 variables. It consists of a mixed of integer and character data.

The different columns are listed as follows:

‘data.frame’: 395 obs. of  33 variables:

\$ school    : chr  “GP” “GP” “GP” “GP” …

\$ sex       : chr  “F” “F” “F” “F” …

\$ age       : int  18 17 15 15 16 16 16 17 15 15 …

\$ address   : chr  “U” “U” “U” “U” …

\$ famsize   : chr  “GT3” “GT3” “LE3” “GT3” …

\$ Pstatus   : chr  “A” “T” “T” “T” …

\$ Medu      : int  4 1 1 4 3 4 2 4 3 3 …

\$ Fedu      : int  4 1 1 2 3 3 2 4 2 4 …

\$ Mjob      : chr  “at_home” “at_home” “at_home” “health” …

\$ Fjob      : chr  “teacher” “other” “other” “services” …

\$ reason    : chr  “course” “course” “other” “home” …

\$ guardian  : chr  “mother” “father” “mother” “mother” …

\$ traveltime: int  2 1 1 1 1 1 1 2 1 1 …

\$ studytime : int  2 2 2 3 2 2 2 2 2 2 …

\$ failures  : int  0 0 3 0 0 0 0 0 0 0 …

\$ schoolsup : chr  “yes” “no” “yes” “no” …

\$ famsup    : chr  “no” “yes” “no” “yes” …

\$ paid      : chr  “no” “no” “yes” “yes” …

\$ activities: chr  “no” “no” “no” “yes” …

\$ nursery   : chr  “yes” “no” “yes” “yes” …

\$ higher    : chr  “yes” “yes” “yes” “yes” …

\$ internet  : chr  “no” “yes” “yes” “yes” …

\$ romantic  : chr  “no” “no” “no” “yes” …

\$ famrel    : int  4 5 4 3 4 5 4 4 4 5 …

\$ freetime  : int  3 3 3 2 3 4 4 1 2 5 …

\$ goout     : int  4 3 2 2 2 2 4 4 2 1 …

\$ Dalc      : int  1 1 2 1 1 1 1 1 1 1 …

\$ Walc      : int  1 1 3 1 2 2 1 1 1 1 …

\$ health    : int  3 3 3 5 5 5 3 1 1 5 …

\$ absences  : int  6 4 10 2 4 10 0 6 0 0 …

\$ G1        : int  5 5 7 15 6 15 12 6 16 14 …

\$ G2        : int  6 5 8 14 10 15 12 5 18 15 …

\$ G3        : int  6 6 10 15 10 15 11 6 19 15 …

The above data is the short description of the dataset which describes the names of the columns, their datatypes and sample values. It has been seen that some of the values are categorical values and some are continuous values.

The dataset can be visualised using R tool as given below:

Figure 1: Distribution of Age

The above data visualisation shows that there are different age groups of students ranging from 15 to 23. The histogram plots the values which is of numeric type. The different columns stores the information about the students such as age, address, gender, size of the family, parent cohabitation status whether they are living separate or together, education of mother and father, job nature of mother and father, why they chosen that particular school, guardian of the student, travelling time from school to home, weekly study time, failures in subject, educational support provided by the school, educational support provided by the school, fees paid, details of extracurricular activities, whether the student has attended nursery school, whether the student has interested in higher education, is internet access available at home, what is the quality of family relationships, free time after school, whether the student has the habit of going out with friends, alcohol consumption made by the student, what is the amount of alcohol consumed by the student at weekend, health status of the student, number of days absent for the school. All the thirty-three columns are the information about the students which best describes the dataset. G1 G2 and G3 are the grades achieved by the students in different periods.

## Research Question Formulation

A number of research question can be formulated with the given dataset such as how many are used to alcohol at weekend and they belong which gender? The descriptive statistics such as mean, median, mode and standard deviation can be calculated out of it. The answers to these questions can be obtained by applying various hypothesis to the problem statement. The question can be answered in such a way that the hypothesis is to be tested. The dataset describes that the student behaviors and the statistics of the values in terms integer and character values. Has the student has been to higher education? Has the student has attended the nursery education? What is the age group of people who has been addicted to alcohol and they belong to which gender? Is the family habitat is good? Based on all these factors we can able to estimate whether the particular student is addicted to alcohol or not. A hypothesis is formulated to answer the question. It can be a null hypothesis or alternative hypothesis that can solve the given problem.

## Null and alternative hypothesis

A hypothesis is a methodology which proves a given theory. A hypothesis that is taken and considered as a scientific hypothesis, it has to be proved with the scientific method. Like anything else in life, there are many paths to take to get to the same ending. Let’s take a look at the different types of hypotheses that can be employed when seeking to prove a new theory. A null hypothesis (H0) exists there has been no relationship between the two variables or a insufficient information to define a scientific hypothesis. This is an attempt to disprove an available hypothesis. All the types of hypothesis will be in some way or other will lead to some form of results.

An assertion about the estimation of a population boundary, if there should be an occurrence of two theories, the assertion thought to be genuine is known as the null hypothesis (notation H0) and the conflicting assertion is known as the alternative hypothesis (notation Ha) (MiniTab, 2019).The null and alternative hypothesis are calculated as follows:

H0:  = 4.5, Ha:  > 4.5

H0: μ ≥ 4.5, Ha: μ < 4.5

H0: μ = 4.75, Ha: μ > 4.75

H0: μ = 4.5, Ha: μ > 4.5

The above hypothesis is based on the fact the average mean of the students who has been addicted to alcohol and how they are used to alcohol at weekends. It also calculates the standard deviation values. In a hypothesis testing, the data that is sampled is evaluated to achieve the decision. It is to be checked how many conditions are satisfied and the claim is calculated for the given population. These hypotheses are estimated as shown below:

1. The null hypothesis which is denoted by H0can be evaluated as the rejected values are not null until it is given by the hypothesis. The null statement can be estimated as some equality form with <= >= and =.
2. The alternative hypothesis can be formulated as Ha which is denoted by <, > and! = operators.
3. If under any scenario if the null hypothesis is rejected, then the alternative hypothesis can be supported with some evidence.
• The null hypothesis that has been framed for the given dataset is there is no difference in the mean health status between male and female alcoholics?
• The alternative hypothesis framed is there is difference in the mean health status between male and female alcoholics?

The dependant and independent variables are Health status and Gender.

What population of data has been affected by alcohol habits and by what ratio?

## Data Visualisation

Data visualisation refers to the visual representation of data in the form of graphs, charts and plots as maps. The data visualisation plays an important role in analysing the data, visualising the trends, patterns and outliers (Cook, 2020). It gives a clear picture of what the data is about. It can easily communicate information to the viewers who can view the data. Here the dataset taken for analysis can be viewed in different forms.

Effective knowledge visual image is that the crucial final step of knowledge analysis. Without it, necessary insights and messages is lost. Import.io understands the importance of knowledge visual image that is why it’s enclosed in our internet knowledge Integration answer. Not solely will internet knowledge Integration extract the information your organization desires from anyplace on the online, it takes that knowledge all the means through the information analysis method of preparation, integration, and consumption, providing you with simply expendable charts and graphs to realize insights from. A simple visualisation of the distribution of age across two gender male and female can be viewed as given below:

Figure 2: Distribution of Age across two genders

Other forms of visualisation can also be done as given below:

Figure 3: Health status across two genders

The dataset consists of health status of the family as well as the students it is seen that the health status has been range from 1 to 5. The female gender falls from 2 to 5 in health status whereas the male gender falls from 3 to 5. The lowest value can be 1 and the highest value can be 1. The health status frequency can be calculated as

Figure 4: Frequency values of Health status

The health values can be counted in terms of frequency and are shows different counts for different values. The maximum count is in health status 5 which records the highest count as 150. There are frequency value ranges from 0 and it goes up to different k values.

# Analysis

The type of analysis that has been carried out is statistical analysis which is an art of collecting information from the hidden patterns and trends. The various forms of statistical analysis are: Summarizing the data: taking the whole data and do some form of data analysis and coming to conclusion that the data can give some meaningful insights. Calculating the key measures of data: For example, calculating the mean gives an average value of the particular data which helps to find some meaningful insights. Calculating the measure which gives the spread such as standard deviation which tells how far the mean value has been distributed.  Also predicting the future based on the past historical data which will be very much helpful in retail, banking, stock market and healthcare.

This prediction will be very much useful for finding out the different future values of the dataset (MiniTab, 2019). Testing a hypothesis is another form of doing analysis where the null hypothesis is proven or it will be made as false. Here in our taken dataset the null hypothesis and the alternative hypothesis are stated and either one of the hypothesis is proven. Statistical analysis can be done in five steps as:

1. Understand the nature of data.
2. Understand the relations that exists between the data and their population of data.
3. Develop a model that summarizes the relation between the data chosen and the population of data.
4. Either prove or disprove the model.
5. Deploy predictive analytics that will guide in taking the future actions.

All the above steps can be deployed with any type of problem and this has been applied to our dataset too. A number of statistical analysis tools are available in the market which will very well perform all these tasks.

## Conclusion

Thus the above forms of analysis provide us with a knowledge extraction of the given dataset. It also has been observed that there are different techniques that can be applied to explore and analyse the data. The data analysis can be applied to obtain statistical inference form data. The statistical analysis always gives us some meaning forms of information. The KDD process can be achieved naturally by applying various descriptive analysis of data. The different forms of analysis can be applied to different statements and thus providing a deeper knowledge about the dataset the health status of individual and as a family. The meaningful information is being extracted from the student data and some form of analysis is done.

## References:

A., R. J., 2019. Statistical Analysis with Missing Data. 1st ed. s.l.:Wiley.

Cook, A., 2020. Kaggle. [Online]
Available at: https://www.kaggle.com/learn/data-visualization

Guru, 2020. guru99. [Online]
Available at: https://www.guru99.com/what-is-data-analysis.html#:~:text=Data%20analysis%20is%20defined%20as,based%20upon%20the%20data%20analysis.[Accessed 31 12 2020].

MiniTab,2019.MiniTab.[Online]
Available at: https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/null-and-alternative-hypotheses/

Statistics,n.d.Statisticshowto.[Online]
Availableat:https://www.statisticshowto.com/statistical-analysis/#:~:text=Statistical%20analysis%20is%20the%20science,Summarize%20the%20data.[Accessed 04 01 2021].

Know more about UniqueSubmission’s other writing services:

Assignment Writing Help

Essay Writing Help

Dissertation Writing Help

Case Studies Writing Help

MYOB Perdisco Assignment Help

Presentation Assignment Help