DATA 540: Databases and Data Retrieval¶
Lab 2: SQL¶
Objectives¶
Run a MySQL database using Docker, and connect using Jupyter.
Write SQL queries of varying complexity on a MySQL database using SELECT. Queries will involve DISTINCT, joins, expressions/functions, GROUP BY, ORDER BY, HAVING, and subqueries.
Getting Started¶
Ensure you have all of the requirements to run MySQL using Docker and connect using Jupyter as needed and explained in Lab 1.
Start the docker server by running
docker-compose up
Connect your Jupyter instance by running the 2 cells below.
%load_ext sql
%sql mysql+mysqldb://tpch:tpch@127.0.0.1/tpch
(MySQLdb._exceptions.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' (111)")
(Background on this error at: http://sqlalche.me/e/13/e3q8)
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Question 1 (20 marks)¶
The sample database is TPC-H which is a standard database used to benchmark database systems. It is designed to be a data warehouse of information on customers, orders, and products.
The queries are generally of increasing complexity with easier queries at the start. Each question is worth 2 marks. Submit a single text file containing all the SQL SELECT statements for the queries. You do not have to provide the output but may if you wish.
Return the nations (n_nationkey and n_name) in region 1. Rename fields to id and name. Note: Do not worry if SQuirreL does not rename fields in displaying table output. (2 marks)
%%sql
SELECT n_nationkey AS id,
n_name AS name
FROM nation
WHERE n_regionkey = 1
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Return the customer keys (no duplicates) that have placed at least one order after November 25, 2018 and the total order price was between \(265,000** and **\)280,000 inclusive. Sort by customer key ascending. (2 marks)
%%sql
SELECT DISTINCT o_custkey
FROM orders
WHERE o_orderdate > '2018-11-25'
AND o_totalprice >= 265000
AND o_totalprice <= 280000
ORDER BY o_custkey
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Return the customer key, name, and account balance for customers with nation key less than 2 and account balance less than 0. (2 marks)
%%sql
SELECT c_custkey,
c_name,
c_acctbal
FROM customer
JOIN nation ON c_nationkey=n_nationkey
WHERE n_nationkey < 2
AND c_acctbal < 0
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
List the part name, supplier name, and nation name of the supplier for all parts where the part name contains ivory and the nation is either ‘CANADA’ or ‘FRANCE’. Order ascending by supplier name and show first 8 rows. (2 marks)
%%sql
SELECT p_name,
s_name,
n_name
FROM part
JOIN partsupp ON p_partkey=ps_partkey
JOIN supplier ON ps_suppkey=s_suppkey
JOIN nation ON s_nationkey=n_nationkey
WHERE (n_name = 'CANADA'
OR n_name = 'FRANCE')
AND p_name like '%ivory%'
ORDER BY s_name
LIMIT 8
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
List pairs of suppliers where the suppilers have the same nation and the first 5 digits of their phone numbers are the same. Show the supplier names, phone numbers, and total account balance (total of two balances). Only show each pair once. Hint: Examine string functions in MySQL, specifically LEFT. (2 marks)
%%sql
SELECT S.s_name,
S2.s_name,
S.s_phone,
S2.s_phone,
S.s_acctbal+S2.s_acctbal AS totalBalance
FROM supplier S
JOIN supplier S2 ON S.s_nationkey=S2.s_nationkey
AND S.s_suppkey < S2.s_suppkey
WHERE left(S.s_phone, 6) = left(S2.s_phone, 6)
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Determine the number of customers, number of unique countries, total customer account balance, and average length of customer address for all customers in any of these three segments: ‘BUILDING’, ‘AUTOMOBILE’, ‘MACHINERY’ and with a nation key less than 10. (2 marks)
%%sql
SELECT count(*) AS numCustomer,
count(DISTINCT c_nationkey) AS numCountries,
sum(c_acctbal) AS totalBalance,
avg(length(c_address)) AS avgLength
FROM customer
WHERE c_mktsegment in ('BUILDING',
'AUTOMOBILE',
'MACHINERY')
AND c_nationkey < 10
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
For each nation name and market segment, display its number of customers, number of orders, and minimum and maximum customer account balances. Make sure to rename fields as in output. Only process customers in a nation that has a A in its name. Only show rows where the number of customers is greater than 10, the number of orders is greater than 180, and the maximum balance is >= 9000. Include customers even if they do not have any orders. (2 marks)
%%sql
SELECT c_mktsegment AS marketSegment,
n_name AS nation,
count(DISTINCT c_custkey) AS numCustomer,
count(DISTINCT o_orderkey) AS numOrders,
max(c_acctbal) AS maxBalance,
min(c_acctbal) AS minBalance
FROM customer
LEFT JOIN orders ON c_custkey=o_custkey
JOIN nation ON c_nationkey=n_nationkey
WHERE n_name like '%A%'
GROUP BY c_mktsegment,
n_name
HAVING numCustomer > 10
AND numOrders > 180
AND maxBalance >= 9000
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Return the part key and size for all parts never shipped in a lineitem (l_shipdate) during the month of May. The part size must also be between 10 and 12 inclusive. Order the output by size descending then key descending. Hint: You will need MONTH function. (2 marks)
%%sql
SELECT p_partkey,
p_size
FROM part
WHERE p_partkey not in
(SELECT l_partkey
FROM lineitem
WHERE month(l_shipdate) = 5)
AND p_size BETWEEN 10 AND 12
ORDER BY p_size DESC,
p_partkey DESC
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Return the customer with the largest account balance. (2 marks)
%%sql
SELECT c_name,
c_acctbal
FROM customer
WHERE c_acctbal =
(SELECT max(c_acctbal)
FROM customer)
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])
Determine the total amount each product is in a lineitem using the l_quantity field. For example, part key 1 has total lineitem quantity of 628. Now calculate the average of these total quantities for all parts (average is about 767). Use this calculated average to return the part manufacturer and brand (as a single concatenated field) for parts that are ordered 60% more than the average quantity parts are ordered. Order by total quantity descending. Hint: Use CONCAT function and will need several subqueries. Goal is to do this in one SQL query (with subqueries) but part marks if do it as a series of SQL queries. (2 marks)
%%sql
SELECT concat(p_mfgr, ' - ', p_brand) AS manufacturerBrand,
p_partkey,
totalQty
FROM
(SELECT l_partkey,
sum(l_quantity) AS totalQty
FROM lineitem
GROUP BY l_partkey) L
JOIN part ON l_partkey = p_partkey
WHERE totalQty >
(SELECT avg(totalQty)
FROM
(SELECT l_partkey,
sum(l_quantity) AS totalQty
FROM lineitem
GROUP BY l_partkey) L)*1.60
ORDER BY totalQty DESC
Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
postgresql://username:password@hostname/dbname
or an existing connection: dict_keys([])