Assignment Task
Copy data and Load Data into Hive Tables
We’ll work with TPC-H data. TPC-H is a standard decision support benchmark for big data analytics. It contains synthetic data and several queries representing a generalized complex analytics workload that answers critical business questions. The data size and distribution can be controlled via arguments while generating the synthetic data. Below is the schema of the benchmark database. The size of the dataset used in this assignment is 1 GB. You can learn more about TPC-H
Dataset size:
| table | number of tuples |
| customer | 150000 |
| lineitem | 6001215 |
| nation | 25 |
| orders | 1500000 |
| part | 200000 |
| partsupp | 800000 |
| region | 5 |
| supplier | 10000 |
Use an app like cyberduck to transfer files from local machine to the cluster. After installing the app, the only step is to establish a connection between local machine and the server. Following figure shows what information you should input. Choose SFTP (SSH …), input Server either turing.hpc.odu.edu or wahab.hpc.odu.edu, your Username and password. Then, click connect. Authentication may needed for connection.
Write and run queries on the dataset. For each question, please only use one query.
- What is the total number of parts offered by each supplier? The query should return the name of the supplier and the total number of parts.
- What is the cost of the most expensive part by any supplier? The query should return the price of that most expensive No need to return the name of the part or the name of the supplier.
- What is the cost of the most expensive part for each supplier? The query should return the name of the supplier and the cost of the most expensive No need to return the name of that part.
- What is the total number of customers per nation? The query should return the name of the nation and the number of unique
- What is number of parts shipped between 8th, 1996 and Nov. 8th, 1996 for each supplier (shipped on Oct. 8th 1996 and Nov. 8th 1996 should also be considered)? The query should return the name of the supplier and the number of parts.
- Find out the list of customers who ordered some parts where the retail price of those parts are smaller than The query should return custkey.
- Who is the customer with highest total order price per nation? The query should return the name of the customer, the name of the nation and total order For each nation, there is only one customer who has the highest total order price.
- A customer is considered a Gold customer if they have orders with total price more than $2,000,000. Customers have orders with total price between $1,000,000 and $2,000,000 are considered Silver customer. Write a single SQL query to compute the number of customers in these two categories
