Advanced and integrated understanding of data modelling

Publish By: Admin,
Last Updated: 17-Oct-23
Price: $120

Big Data Management

Learning Outcome 1: Demonstrate advanced and integrated understanding of data modelling, storage, and retrieval methods and apply knowledge and skills to retrieve information from data storage;

Learning Outcome 2: Apply knowledge and skills to design and complete a project to coordinate and manage large data sets;

Learning Outcome 3: Analyse critically and interpret the knowledge from large data sets;

Learning Outcome 4: Interpret and transmit information and knowledge in the application discipline to specialist and non-specialist audiences;

Learning Outcome 5: Analyse critically and reflect on the issues of privacy and ethics of Big Data.

Suppose you are working for the Australian Government as a "Data Scientist" to tackle COVID-19 or any other future pandemic. Google has released a dataset on people`s mobility during the pandemic. As a "Data Scientist," you have found some critical information from that dataset, which helped Australia understand COVID-19. Now, you are famous :).

So, Australia`s Prime Anthony Albanese has hired you in his special Foreign Affairs team. He wants you to compare Australia`s pandemic situation with any other country. Luckily, you have the dataset from Google and another new dataset regarding the COVID-19 cases in the government-secured server. Suppose the size of each dataset is 100 petabytes.

Therefore, you have chosen to use Spark to complete the analysis.

In this assignment, you will add some old information from Assignments 1 and 2.

Tasks of the Assignment:

• Explore two datasets and identify a research question.
• Now create spark distributed data frames from these datasets.
• Explore, Filter, and Analyse datasets using spark.
• Based on the analysis, answer the research question.
• You need to use spark for all kinds of analysis. You can only bring the data to pandas for visualisation.

1. Introduction
• Provide a brief discussion of the mobility dataset details.
• Provide a brief discussion of the covid case (cc) dataset details.
• From where did you download the mobility dataset?

2. Data Exploration:
• Discuss the size of the mobility dataset.
• Discuss the size and format of the cc dataset.
• Discuss the format of the mobility dataset.
• Discuss the features (columns) of the mobility dataset.
• Discuss the features (columns) of the cc dataset.

3. Literature Review:
• Find at least two research works from "Google Scholar (Any preprint or published work)" where the researchers have used this mobility dataset. Please provide a brief discussion of their research. How did the researchers use this dataset to answer their research question?
• Find at least two research works from "google scholar (Any preprint or published work)" where the researchers have used this cc dataset. Please provide a brief discussion of their research. How did the researchers use this dataset to answer their research question?

4. Research Question/Selection of the Problem:

• Identify a research question that you can answer after analysing both datasets. The research question must focus on countries, such as Australia and the UK.
• Justify your research question. Why is your research question important for comparing the COVID-19 situation between Australia and other countries?

5. Method (3):
• You are using Spark as you are dealing with big data. By the way, what is Spark?
• Why did you choose spark over Hadoop MapReduce?

6. Connection Between Datasets:
• How can you connect these two datasets to answer your research question?
• List the steps you have taken to find out the useful subset of the datasets.

7. Data Analysis:
• Provide a detailed analysis with appropriate visualisations to answer the research question.

{Relevant Discussions according to the Visualisations})

8. Findings:
• Provide the discussion to answer your research question based on the findings from the analysis.

9. Ethics and Privacy:
• Research Australian Law on collecting public data and show the validity of this mobility dataset according to Australian Law.
• Research Australian Law on collecting public data and show the validity of this cc dataset according to Australian Law.

10. Hosting on a server

• Please create a Spark cluster in AZURE and run your analysis code in that cluster. Now, record a video with any screen capturing software. The recording should show that you are using AZURE and you are running your whole code in the AZURE server using Spark. Upload this video to Google Drive and share the link at the end of the report or in a separate file named.

12. Presentation and Viva:

• Students need to present their work and findings. Questions will be asked at the end of the presentation
10. Writing Style and Report Format:

• The report is clearly written, and sections are connected.
• The report follows the given structure.
• Proper and correct in-text citation is presented in the report.
• The report cannot exceed fifteen pages (Page count includes everything from the table of contents to references and appendix). Any front of size 12pt is accepted.