Class 2C: Jupyter Notebook¶

We will begin at 2:00 PM! Until then, feel free to use the chat to socialize, and enjoy the music!

../../../_images/programming.jpg


Photo by Christina Morillo from Pexels
Sept 17, 2021
Firas Moosvi

Lecture Outline¶

1. Announcements (5 mins)

2. Tour of the Jupyter Notebook Interface (15 mins)

3. Introduction to the Project and finding team members (10 min)

4. Markdown Syntax (10 mins)

Announcements¶

  • Lab 1 will be due Saturday September 18, 2021 at 6 PM

  • Lab 2 will be due Saturday September 25, 2021 at 6 PM

  • Learning Log 1 is now released on Gradescope! It is due on Sunday Sept 19, 2021 at 6 PM (It has a grace period if you need it)

  • Everybody should now be on Ed Discussion by now!

    • We will be using it for all course-related announcements

  • My office hours will be 30 minutes after every class on Monday, Wednesday, and Friday 15:00 - 15:30

  • Lab sessions will be on Zoom, find the link to your Zoom link on Canvas.

../../../_images/zoom3.png

Tour of the Jupyter Notebook Interface¶

../../../_images/base.png ../../../_images/notebook1.png ../../../_images/notebook2.png ../../../_images/notebook3.png ../../../_images/notebook4.png ../../../_images/notebook5.png ../../../_images/notebook6.png

Introduction to the Project and finding team members (10 min)¶

Milestone 1 - Find Dataset¶

In this milestone you will be expected to choose a dataset appropriate for the DATA 301 project. The most important task for this milestone is to select an appropriate dataset and find a team to do the project with.

Overall Expectations¶

  • On average, all team members should be contributing to the project equally!

  • Each team member is responsible for their own research question(s), but the data processing, wrangling, and cleaning steps can be shared.

  • Your question, analysis and visualizations should make sense, be well-formed, and it does not have to be complicated.

  • You should use proper grammar and full sentences. Point form may occur, but should be less than 30% of your written documents.

  • You must use proper English, spelling, and grammar and you should write concisely.

  • There should be a plan in place to to deal with any teamwork conflicts and issues.

Task 1. Choosing a topic and a dataset and get it approved (50%)¶

  1. YOU MUST HAVE YOUR DATA SET APPROVED BY A TA or the instructor.

    • To get a dataset approved, fill out the questions on Gradescope. The requirements to choose a dataset are below.

  2. Note: Though it may sound easy, it is not trivial to choose an interesting and relevant dataset. There are many, many thousands out there and the tyranny of choice is pretty overwhelming. I suggest you choose an “industry/sector” (health, technology, finance, sports, etc…), then set a 60 minute timer, start searching, and then choose one before the timer expires. You are welcome to post an issue on Ed Discussion if you want advice or approval of a dataset.

  3. Here are the requirements for choosing a dataset:

Permission to use and distribute

  • Look for a creative commons license (CC4 for e.g.) or Public Domain and check to make sure you can make it publicly available

  • Do not use datasets that require authentication, or APIs

Data quality

  • Try to choose datasets that have no more than 5-10% missing values

  • Ensure there are over 5000 observations in the dataset (this means that you can have 5 columns and 1000 rows, or 10 columns and 500 rows, etc…)

  • Ensure there are at least 5 variables of potential interest in the dataset

Interesting (to you)

  • Make sure you have some basic interest in the subject matter!

  • There’s nothing worse than doing a 6 week project on the a boring dataset (please don’t pick a movies dataset)

  • In the final weeks of the course you will be building a Dashboard with your data so choose wisely!

Add your dataset to the repository

  • If your dataset is a file, and you have permission to redistribute it, you should add it to the data/raw directory

  1. Below are some examples of datasets you are welcome to use for your project:

There are literally hundreds of thousands of datasets available, I will point you to some high quality sources (keep in mind that I have not personally checked every single dataset):

  1. There are a list of datasets that you are not allowed to choose either because A) I know them to be popular not but not great for this project, B) too complicated, C) too simple, or D) many students have chosen it in the past and I am now sick of seeing the same analyses :-).

Task 2. Introduce and describe your dataset and topic. (30%)¶

Once you choose your dataset, you will need to describe your dataset, as well as the topic(s) or research questions you are interested in.

If you are doing a group project, you should do this task together and only one response should be submitted. Feel free to personalize it a bit though and add sentences or points about individual members of the team.

The answers to these questions should be placed in the project’s main README.md file (located in the main repository).

  1. Describe your dataset in about 150-200 words

Consider the following questions to guide you in your exploration:

  • Who: Which company/agency/organization provided this data?

  • What: What is in your data?

  • When: When was your data collected (for example, for which years)?

  • Why: What is the purpose of your dataset? Is it for transparency/accountability, public interest, fun, learning, etc…

  • How: How was your data collected? Was it a human collecting the data? Historical records digitized? Server logs?

Additional Guidance: Your audience is fellow data scientists. You probably will not need more than 150 words to describe your dataset. All the questions above do not need to be answered, it’s more to guide your exploration and think a little bit about the context of your data. It is also possible you will not know the answers to some of the questions above, that is FINE - data scientists are often faced with the challenge of analyzing data from unknown sources. Do your best, acknowledge the limitations of your data as well as your understanding of it. Also, make it clear what you’re speculating about. For example, “I speculate that the {…column_name…} column must be related to {….} because {….}.”

  1. Describe your topic/interest in this dataset - answer in about 150-200 words

Some questions you may wish to consider:

  • What do you hope to do with your analytics project?

  • Why are you interested in this topic or dataset?

  • Do you have any questions you specifically want to explore?

  • Could you imagine building a user-facing Dashboard with this dataset?

    • Note: In the final weeks of the course you will be building a Dashboard with your data so choose wisely!

Task 3: Submission (20%)¶

For each Milestone there will be two submissions on Gradescope:

  1. Submit your progress to date

  2. Each team member will also submit a teamwork reflection on Gradescope.

The purpose of this individual report is to give you an avenue to present your viewpoint as to how the project went, how the group worked together, and your role in the group. Each group member must complete this form to report on the contributions of yourself and the other group members. That information can, if needed, be used to adjust final grades of individuals. This is a report private between you and the instructors, meaning that none of your classmates will see this report.

Team Assignments¶

You should try to reach out to your teammates as soon as possible via Canvas messages, email, text, etc.

Project Teams¶

Markdown Syntax (10 mins)¶

That’s it! See you next week!¶

from IPython.display import IFrame
from IPython.display import Markdown
# Additional styling ; should be moved into helpers
from IPython.core.display import display, HTML
HTML('<style>{}</style>'.format(open('rise.css').read()))