Module Code : CIS7031 Employment in Wales

 Module Code : CIS7031 Employment in Wales 

WRIT 1

1. Data processing

1.1 Downloaded dataset of Wales total employment valu

Industry2018201720162015201420132012201120102009
Agriculture41100402004320040700427003680036100361003820037700
Production165700165100162500172300173300164200154400158600149800156700
Construction1018009080010270092600970008930091300900009320096600
Retail190600185100198400206700194300200900202800202600202700203700
ICT31500589003440024000357002690027200264002790027800
Finance35500321003100030800324003240031100332002980033800
Real_Estate25200182002270019100222001800018800176001460013500
Professional_Service89000823007200078500648007250059100639006400064100
Public_Adminstration89600860008320085600867008950088100887009310092800
Other_Service81800832007240077200733007550072800724006800064200

Sample code to create a datafram is given below

 Module Code : CIS7031 Employment in Wales

1.2 Check for any null value or outlier.

The data points that are far from the data points are called outliers. Outliers are unusual values in the given dataset which are problematic in statistical analyses which may cause miss significant findings or distort real results.

No definite statistical rules are available for identifying these outliers it depends upon the knowledge of the subject and the understanding of the collected data process.

There are some guide lines for the statistical test for finding the outlier. Outliers are notably different from all the data collected.  There are many methods to find the outliers in the given data set which are categorised into visual and analytical assessments. Sorting the datasheet is one of the simple and effective ways to find out the unusual values.

Boxplots, histograms and scatterplots are also used to find out the outliers. Boxplots display any symbols or asterisk on the graph to indicate the outliers in the datasets. It uses the inter quartile method to find the outliers. It can also be used to find the outliers in the group of the data. Histograms help to emphasize the outlier by isolated bars.

Scatterplots can also used to detect outliers in multivariate variables. A scatterplot along with the regression line is used to find out the points that are followed to the line and the points that are away from the line are outliers. Hypothesis tests can be used to find out the outliers.

Z-scores are also used to find the outliers which are used to quantify the unusualness of an observation when the data follow a normal distribution. It is the number of standard deviation of the each value. When the values are equal to the mean then the Z-score value will be represented by zero.

The outliers can also be found out using the inter quartile range (IQR).  There are no null values found in the data set.

Check for finding outlier is done by calculating the 1st Quartile (Q1) and 3rd Quartile (Q3) and based on the value obtained IQR (Inter Quartile Range) is calculated. Upper bound is set using the formula Q3 + (1.5*IQR) and the lower bound is calculated using the formula Q1 – (1.5*IQR).

Outlier is identified by comparing the given value with the upper bound and the lower bound. It the given value is greater than the upper bound and less than the lower bound the data is set to be an outlier. In the given dataset no outlier are found.

1.3 The final data frame

Industry2018201720162015201420132012201120102009
Agriculture41100402004320040700427003680036100361003820037700
Production165700165100162500172300173300164200154400158600149800156700
Construction1018009080010270092600970008930091300900009320096600
Retail190600185100198400206700194300200900202800202600202700203700
ICT31500589003440024000357002690027200264002790027800
Finance35500321003100030800324003240031100332002980033800
Real_Estate25200182002270019100222001800018800176001460013500
Professional_Service89000823007200078500648007250059100639006400064100
Public_Adminstration89600860008320085600867008950088100887009310092800
Other_Service81800832007240077200733007550072800724006800064200

2. Data analysis

2.1 Highest and lowest workers

Retail industry has employed highest workers and real estate employed the lowest workers over the period of 2009 to 2018. The graph shows depicts the dataset of Wales employment.

 Module Code : CIS7031 Employment in Wales

2.2. Highest and lowest overall growth

Retail industry has the highest growth and the real estate have a lowest growth. The graph shown in 2.1 supports the answer.

2.3 Best and worst performing year

Year 2018 is the best performing year and 2013 is the worst performing year in relation to number of employment.

 Module Code : CIS7031 Employment in Wales

3. Visual analysis

Create a dynamic scatter/bubble plot showing the change of workforce number over the period using

 Module Code : CIS7031 Employment in Wales

4. Correlation

4.1 Average employment number

Average employment number for each industry over the period

IndustryAverage employment number
Agriculture39280
Production162260
Construction94530
Retail198780
ICT32070
Finance32210
Real_Estate18990
Professional_Service71020
Public_Adminstration88330
Other_Service74080

Retail industry is the highest correlated industry while real estate is the lowest correlated industries.

4.2 Year wise correlation for each industry

 2018201720162015201420132012201120102009
Agriculture1         
Production0.9825631        
Construction0.9929460.9823981       
Retail0.996220.9820020.9958121      
ICT0.9889760.9841390.9962730.9940681     
Finance0.9944410.9832120.9953580.9986840.9946281    
Real_Estate0.9854250.9749040.9946250.9929520.9928280.9964051   
Professional_Service0.9891760.9780120.9953470.9958180.9942010.9985760.9993551  
Public_Adminstration0.9852960.9737430.9935730.9909920.9892270.995020.9981150.9977761 
Other_Service0.9861120.9739680.9947850.9914460.9918940.9950550.9973010.9977020.9989221

Yes all the industries are correlated over each year except the retail industry where the correlation value of the retail industry is very low during the year 2018 when compared with other industries. All other industries have only a slight variation in the correlation values over the years.

5. Clustering (k means & hierarchical)

5.1 K Means Clustering

K means is used to find the local maxima using iterative clustering algorithm. In the first step k is taken as 2. Random points are assigned to each point in the cluster. Then the centroid point for the cluster is selected using a different colour. Closest cluster centroids are re-assigned.

Cluster centroids are recomputed for the entire cluster. The cluster is formed by taking k value as 3 and  the cluster is formed between the clusters of the successive points and the steps are repeated until there is now switching of data between two clusters which marks the termination of the k means algorithm.

5.2 Hierarchical cluster

It is one of the most common testing techniques used in Machine Learning. Hierarchical clustering is divided into two types Agglomerative and Divisive. In agglomerative each data point is considered as an individual cluster and merges the similar cluster during each iteration and the clustering is done until one or K clusters are formed.

The Agglomerative algorithm is straight forward it computes the proximity matrix and each point is considered as a cluster, the process of merging the two closest clusters is done and it is updated in the proximity matrix this process is repeat until a single cluster remains.

Computation of the proximity of two clusters is the key operation of this algorithm. Hierarchical clustering can be visualized using a Dendrogram it is a tree-like structure diagram that which records the sequences of merge and splits in the data.

It is the most common used clustering technique while Divisive hierarchical clustering is not commonly used. It is opposite to the Agglomerative clustering. In this method all the points are considered as a single cluster during each iteration the data that are not similar are separated from the cluster.

Such separated points are considered as an individual cluster. The single cluster is divided into n clusters in the Divisive hierarchical clustering. Hierarchical clustering algorithm builds the cluster hierarchy.

In hierarchical cluster consider all the data points in the data set and form its own cluster. Two nearest clusters are merged into one cluster this will continue until a single cluster is framed. In k means cluster we will a mean value for cluster and frame the cluster.

Hierarchical cluster cannot be used for big data while K means cluster is best suited for big data. The time complexity of K means is linear O(n) while hierarchical clustering is O(n2). K means a random choice is made to choose the cluster.

 Module Code : CIS7031 Employment in Wales

6. Discussion

.The employment landscape of Wales was constantly increasing in all the fields constantly during all the years. In the year there is only a slight variation among the industries based on the employments made. Different type of analysis is made on the data.

The data is first collected from the website and data frame is formed. Analysis is made to find if there is any outliers in the data set by calculating the first quartile and the third quartile of the date then the inter mediate quartile is framed upper and lower bound of the data for each industry is set and compared with the data frame to find out the outliers.

A visual analysis is made on the data frame by creating a bubble scatter chart using plotly which shows that all the clusters are close to each other. Then a correlation was created on the data for every product over the year.

By analysing the correlation we can find that all the correlation values are mostly same except one value of the Retail industry during the year 2018.

Then using the same data set K means clustering and hierarchal clustering algorithms are applied and clusters are framed in K means by keeping the k value as 2 in the first iteration and a second iteration is done with the k value 3 where clusters framed during the two iterations are almost same.

Then a hierarchal cluster algorithm is applied for the same data set which takes all the data in the data set and clusters are framed.

The final result show that the employment landscape of the Wales is constantly increasing over the period of time without more deviation in all the industry. Different analysis made on the data set confirms that the employment of the Wales is gradually increasing without any deviation throughout the years for all the 10 industries.

Reference

  1. O. Omolehin, A. O. Enikuomehin, R. G. Jimoh and K. Rauf, (2009) “Profile of conjugate gradient method algorithm on the performance appraisal for a fuzzy system,” African Journal of Mathematics and Computer Science Research,” vol. 2(3), pp. 030-037.
  2. V. Anand Kumar and G. V. Uma, (2009) “Improving Academic Performance of Students by Applying Data Mining Technique,” European Journal of Scientific Research, vol. 34(4).

Azar, A., Sebt, M. V., Ahmadi, P., & Rajaeian, A., (2013) “A Model for Personnel Selection with A Data Mining Approach: A Case Study in A Commercial Bank: Original Research”. SA Journal of Human Resource Management, 11(1), 1-10.

Al-Radaideh, Q. A., & Al Nagi, E., (2012) “Using Data Mining Techniques to Build a Classification Model for Predicting Employees Performance”. International Journal of Advanced Computer Science and Applications, 3(2).

Lakshmi, T. M., Martin, A., Begum, R. M., & Venkatesan, V. P. (2013) “An Analysis on Performance of Decision Tree Algorithms using Student’s Qualitative Data”. International Journal of Modern Education and Computer Science, 5(5), 18.

Priyama, A., Abhijeeta, R. G., Ratheeb, A., & Srivastavab, S. (2013) “Comparative analysis of decision tree classification algorithms”. International Journal of Current Engineering and Technology, 3(2), 334-337.

Ronald E. Shiffler (1988) Maximum Z Scores and Outliers, The American Statistician, 42:1, 79-80,

Leave a Comment