DATA 550 Mini-Project 1#

In lieu of quizzes in DATA 550, we will instead do two mini projects. The project will be done in pairs of 2, and both members of the team will receive the same work unless the work distribution was not roughly even (as determined by looking at the commits). In the first project, you will need to select a programming language, either python or R and you will be provided a dataset. In the second project, you will use the other programming language and will be free to select your own dataset.

Project Instructions#

Here are the instructions for your project:

  1. Do this mini-project in the same pairs you have selected for your labs. Take note of your group number (from Canvas) and follow all instructions in this section.

  2. Choose a programming language, R or python. If you select R, you must use ggplot2 for your plotting framework and if you choose python, you must use Altair for your plotting framework.

  3. For your assigned dataset, you should first do a full and detailed Exploratory Data Analysis (see Lecture 3) in a Jupyter notebook.

Group Number

Dataset

Groups 1 - 8

Default of credit card

Groups 9 - 15

Medical expenses

Requirements for the EDA:

  • Must include all eight steps of the EDA (Describe your dataset, Load the dataset, Explore your dataset, Initial thoughts, Wrangling, Research Questions, Data analysis & visualizations, Summary and conclusions)

  • Must include at least 3 visualizations, and no more than 6

  • Each visualization must adhere to the principles of effective visualizations as discussed in Lecture 4 and 5.

  • Comments on your EDA must be authentic and genuine, ideally in full sentences.

  1. Once you have done an EDA, you should come up with two follow-up research questions. (10 marks)

Each research question should fulfill ONE of these two criteria:

A) RQ should be answerable with this dataset but requires additional data processing or wrangling that is outside the scope of this mini-project; OR

B) RQ cannot be answerable with this dataset and requires another dataset. If you choose this criteria, make sure to describe what this hypothetical data would include (provide the column names at minimum).

Note: you do NOT need to answer the research questions! In this task we will only evaluate your ability to create research questions.

  1. Present and record the results of your EDA as a 5-min video.

Important: I am not expecting any video-editing, fancy equipment, or even a slide presentation. I want to hear the results of the analysis from you in fewer than five minutes (!).

Keep it low-tech, I suggest you and your partner get on a zoom call, share your screen with the Jupyter notebook, describe the analysis, and record the call. I should hear from both of you in the presentation, but it’s up to you whether you show both the R and python plots, or just have one notebook.

  1. Edit the contributions.md file to your repository to outline the contributions of each partner in the group.