ITECH 2201 Cloud Computing
School of Science, Information Technology & Engineering
Workbook for Week 6 (Big Data)
Please note: All the efforts were taken to ensure the given web links are accessible. However, if they are broken – please use any appropriate video/article and refer them in your answer
Part A (4 Marks)
Exercise 1: Data Science(1 mark)
Read the article at http://datascience.berkeley.edu/about/what-is-data-science/ and answer the following:
What is Data Science?
Answer: The concept of data sciences has a different meaning in viewpoint of various researchers and in this article there is not such relevant meaning of sciences is given in this article. Thus the concept of sciences can be explain as a process of exploration of origination of data in such a form that how use the particular technology for generating more profits. Likewise, sciences is generally used in the company production process such as how particular technology can be install for increasing the capacity of the machinery and equipments.
According to IBM estimation, what is the percent of the data in the world today that has been created in the past two years?
Answer: As per the IBM evaluation and measurements, there is approx 90% of the data in the global market that has been created in the past two years.
What is the value of petabytestorage?
Answer: The value of petabytestorage is 10 to the power of 15 powers. In the comparison of gigabytes and petabyte, there is millions of gigabyte
For each course, both foundation and advanced, you find at http://datascience.berkeley.edu/academics/curriculum/briefly state (in 2 to 3 lines) what they offer?Based on the given course description as well as from the video. The purpose of this question is to understand the different streams available in Data Science.
Answer: The foundation and advanced course in the data sciences are divided into 15 to 9 chapters. The topics are included in this are research design, applications of data. It also considers the exploration of data, storing and communication of data. Thus, the foundation course provides the data related to the sciences and it will also prove to be beneficial for the readers in terms of gathering the knowledge about sciences.
Exercise 2: Characteristics of Big Data(2 marks)
Read the following research paper from IEEE Xplore Digital Library
Ali-ud-din Khan, M.; Uddin, M.F.; Gupta, N., “Seven V’s of Big Data understanding Big Data to extract value,” American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the , pp.1,5, 3-5 April 2014
And answer the following questions:
Summarise the motivation of the author (in one paragraph)
Answer: The motivation that author wants to give the reader through its course book is that how they can use the big data in an effective manner regrading for their future performances. The course book also indicates the vision of developed healthcare and the product of big utilisation. Besides that this work book helps in the students for performing their future work.
What are the 7 v’s mentioned in the paper? Briefly describe each V in one paragraph.
Answer: There are seven V’s that are present in the course book are velocity, volume, variety, validity, veracity, value, visibility etc.
Volume: It includes loads of information about sound. Likewise in case of YouTube, customers can transfer any video in 48 hours.
Variety: It includes the structured and unstructured information that can helps in web based social networking.
Velocity: It means that the information arises in the quick manner.
Validity: the selected should need to be valid enough for deriving the accurate result.
Veracity: It indicates that how much data is trustable and relevant for the readers.
Value: It indicates that the selected data should be adding value for the reader.
Visibility: it means that data should be collected from different sources.
Explore the author’s future work by using the reference  in the research paper. Summarise your understanding how Big Data can improve the healthcare sector in 300 words.
The big data can be prove helpful for the company in case when the data are collected from different sources and at the same time, the proper governances should be there for collection of data. In case of health care sector, the big data should include the lower risk of disaster. Health care companies for storing the big data in their storage requires to install the systematic data warehouse architecture for properly functioning of all the levels in the organisation. Thus, in that case health care companies get benefit from the big data system.
Exercise 3: Big Data Platform(1 mark)
In order to build a big data platform – one has to acquire, organize and analyse the big data. Go through the following links and answer the questions that follow the links: Check the videos and change the wordings
Please note: You are encouraged to watch all the videos in the series from Oracle.
How to acquire big data for enterprises and how it can be used?
Answer: The big data can be acquiring from the HTTP system, analysis of market needs, and research on customer taste and preferences. Then after collecting of data, the company stored the collected data into the cloud in order to use them.
How to organize and handle the big data?
Answer: After the collecting of big data, the company can handle the large amount of data through using the filtering, changing and sorting. Moreover, company could also use the oracle software for organise the unstructured information as it design the system in such a way that each and every facts and data are storage systematically. This assists the employees for obtaining the data at the right time. Thus, the oracle system deals with the fulfilments of requirements of the employees, organisation and customers through systematic handling of big data.
What are the analyses that can be done using big data?
Answer: For the analysis of big data, oracle provides the statistical and advanced analysis system for evaluation of collected data. Oracle in their system offers the oracle Exadata and warehousing system for analysing the data and allows the employees to access the data easily for performing the activity.
Part B (4 Marks)
Part B answers should be based on well cited article/videos – name the references used in your answer.For more information read the guidelines as given in Assignment 1.
Exercise 4: Big Data Products (1 mark)
Google is a master at creating data products. Below are few examples from Google. Describe the below products and explain how the large scale data is used effectively in these products.
- Google’s PageRank:
It is the algorithm used by the Google search for ranking the search engine results. The Google page rank works by checking the connections to the page in order to decide how much website is critical to search.
- Google’s Spell Checker:
It is the application that corrects the wrong spelling in reports by analysing the report. The spell checkers are well equipped in checking the wrong spelling and provide right words. This application provides the various benefits to researcher or writers in terms of frame the correct statements and provide accurate meaning to the readers. This application is finding in the word processor, emails and on internet search engines.
- Google’s Flu Trends:-
It is web services operated by Google. It also offers the services of estimation of influenza in more than 25 countries and its main purpose to make accurate predications related to flu activity.
- Google’s Trends:-
It is a public web page that has high trends among the customers. This platform consider as a more demandable search engine that provides various opportunities to people related to sharing information and searching facilities also.
Like Google – Facebook and LinkedIn also uses large scale data effectively. How?
The data that is share in the facebook and LinkedIn are widely used by the companies for sending information to their clients for the purpose to develop the marketing plan and business plan by analysing the interest area of people. Likewise, facebook also help the companies to share their product information through sending their product images in the FB page. Thus, the social sites plays an significant role in analysing and gathering big data of the customers.
Exercise 5: Big Data Tools(2 marks)
Briefly explain why a traditional relational database (RDBS) is not effectively used to store big data?
Answer The RDBS consider as a not much appropriate for collecting big data as RDBS is inflexible in nature and it supports only structured information. But due to the changing environment, the data are unstructured so in that case the RDBS system does not able to perform for organising the data.
What is NoSQL Database?
Answer: NoSQL database is the one that provides the application to store and maintain the data in a modelled format. This means that tabular data that is used in showing the relationship in database. It is widely used in case of big data. It is also called as not only SQL that contribute in the SQL query language. It is also used for the creating large database system.
Name and briefly describe at least 5 NoSQL Databases
Answer there are five NoSQL that are as follows:-
- MongoDB:-MongoDB is the document form database that provides support to JSON format and it is also easy to use and function in nature. It also offer the solutions, enquires facilities.
- Redis: it is one of the fastest data storage system as it has open sources inmemory that increase the speed and performances of the system and it is also widely used in the companies.
- Cassandra: It is a hybrid database system with the wide information storage. It is very useful when it comes to managing big data.
- CouchDB: this application helps in making easy to use the web applications. This also supports the changing data in a systematic order. Thus it will be prove beneficial for the company to gather changing needs of the customers.
- HBase: it is considered as one of the powerful database that spread the data among in all the nodes very quickly. It is well suited for the large data.
What is Map Reduce and how it works?
Answer: It is the programming model that is helpful for implementation the processing and generating the big data sets. At the same time it also assists in distribution of information. It performs the two main functions that are sending the work to various nodes and reduce the query in each node.
Briefly describe some notable Map Reduce products (at least 5)
Answer: There are six Map Reduce products that are as follows:-
- Disco Project
- Apache Hadoop
Amazon’s S3 service lets to store large chunks of data on an online service. List some 5 features for Amazon’s S3 service.
Answer: there are five features for Amazon’s S3 service that are as follow:-
- Reduced Redundancy Storage
- Simple Interface
Getting the concise, valuable information from a sea of data can be challenging. We need statistical analysis tool to deal with Big Data. Name and describe some (at least 3) statistical analysis tools.
Answer: Statistical tool that are used to analysis of the big data which are mention below:-
- Mean: The mean is used to identify the trends of the data and provide quick result of the data and it is also useful for calculating quickly any equations.
- Standards deviations: The standards deviation is more used to determine the dispersion of data points.
- Regression: Regression is the model that is used in studying the relationship between the dependent and explanatory variables.
Exercise 6: Big Data Application (1 mark)
Name 3 industries that should use Big Data – justify your claim in 250 words for each industry using proper references.
Answer: Three industries that should use Big Data that are transportation, retail and whole sale trade and government.
- Retail and wholesale trade:
The big retail and wholesalers need to use the big data in order to identify the chaining customer needs and preferences. For that, companies can use Microsoft, Cisco and IBM etc.
In case of transportation, private sectors use the big data in transport revenue for saving the fuel and time of travelling. Big data helps the transportation companies to properly supply of products to the right person and at right time.
Government bodies especially require to using the big data system for properly utilisation of resources and for conducting the research about the people and their chaining habits. This will guide the government to take the effective decision making.
Ananthanarayanan, G., Kandula, S., Greenberg, A. G., Stoica, I., Lu, Y., Saha, B., & Harris, E. (2010, October). Reining in the Outliers in Map-Reduce Clusters using Mantri. In OSDI (Vol. 10, No. 1, p. 24).
Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of big data to health care. Jama, 309(13), 1351-1352.
Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2014). Data mining with big data. ieee transactions on knowledge and data engineering, 26(1), 97-107.
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H. A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS quarterly, 36(4), 1165-1188.
Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 3.
Afrati, F. N., & Ullman, J. D. (2010, March). Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology (pp. 99-110). ACM.