Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.

Publish By: Admin,
Last Updated: 10-Jul-23
Price: $120

Learning outcome 1: Provide a broad overview of the general field of big data systems`

Learning outcome 2: Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.

Task

You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using PySpark).

General instructions

You will follow a typical data analysis process:

1. Load / ingest the data to be analysed

2. Prepare / clean the data

3. Analyse the data

4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to download onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. This may require an initial perusal of the dataset to eliminate any glaring issues. Once the dataset is loaded, you will be required to use Sqoop to get the data into Hadoop. Before you do any processing, you may dump equivalent CSV files to import into your PySpark version of the solution.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using PySpark.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python`s matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

The creation of a single notebook that executes the entire process of loading the supplied data to exporting the result data required for visualisation. (The creation of two separate notebooks, one for loading data and doing the SQL analysis using either Hive or Impala, and a second notebook for the PySpark part would count under this option).