Lecture 7 - Functions, Pandas, Data Cleaning, and Preparation¶
In this lecture we will talk more about:
Functions in Python
Pandas Dataframes
Functions in Python¶
Pandas¶
This tutorial has been adapted from the creator of Pandas, Wes McKinney.
Data loading¶
Last time, and in Lab 3, we talked about loading data in Python using pandas.
Let’s talk now about how to clean and prepare data using Pandas.
Handling Missing Data¶
0 | |
---|---|
0 | False |
1 | False |
2 | True |
3 | False |
Filtering Out Missing Data¶
0 | |
---|---|
0 | 1.0 |
1 | NaN |
2 | 3.5 |
3 | NaN |
4 | 7.0 |
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
0 | 1 | 2 | |
---|---|---|---|
0 | -0.147671 | NaN | NaN |
1 | 0.808575 | NaN | NaN |
2 | -0.417914 | NaN | 1.339269 |
3 | -1.114100 | NaN | -1.934764 |
4 | -0.369869 | -0.100167 | 0.338057 |
5 | 1.358480 | -1.395844 | -2.302565 |
6 | 0.537828 | 0.146920 | -0.485762 |
0 | 1 | 2 | |
---|---|---|---|
4 | -0.369869 | -0.100167 | 0.338057 |
5 | 1.358480 | -1.395844 | -2.302565 |
6 | 0.537828 | 0.146920 | -0.485762 |
Filling In Missing Data¶
0 | 1 | 2 | |
---|---|---|---|
0 | -0.147671 | 0.000000 | 0.000000 |
1 | 0.808575 | 0.000000 | 0.000000 |
2 | -0.417914 | 0.000000 | 1.339269 |
3 | -1.114100 | 0.000000 | -1.934764 |
4 | -0.369869 | -0.100167 | 0.338057 |
5 | 1.358480 | -1.395844 | -2.302565 |
6 | 0.537828 | 0.146920 | -0.485762 |
0 | 1 | 2 | |
---|---|---|---|
0 | -0.147671 | 0.500000 | 0.000000 |
1 | 0.808575 | 0.500000 | 0.000000 |
2 | -0.417914 | 0.500000 | 1.339269 |
3 | -1.114100 | 0.500000 | -1.934764 |
4 | -0.369869 | -0.100167 | 0.338057 |
5 | 1.358480 | -1.395844 | -2.302565 |
6 | 0.537828 | 0.146920 | -0.485762 |
0 | 1 | 2 | |
---|---|---|---|
0 | -0.147671 | 0.000000 | 0.000000 |
1 | 0.808575 | 0.000000 | 0.000000 |
2 | -0.417914 | 0.000000 | 1.339269 |
3 | -1.114100 | 0.000000 | -1.934764 |
4 | -0.369869 | -0.100167 | 0.338057 |
5 | 1.358480 | -1.395844 | -2.302565 |
6 | 0.537828 | 0.146920 | -0.485762 |
0 | 1 | 2 | |
---|---|---|---|
0 | 0.331259 | -0.619757 | 1.100359 |
1 | 0.469276 | -0.696190 | -0.816206 |
2 | 0.617573 | -0.696190 | 0.235891 |
3 | 3.014322 | -0.696190 | -0.004664 |
4 | 1.199192 | NaN | -0.004664 |
5 | -0.835414 | NaN | -0.004664 |
Data Transformation¶
Removing Duplicates¶
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
6 | two | 4 |
Transforming Data Using a Function or Mapping¶
food | ounces | |
---|---|---|
0 | bacon | 4.0 |
1 | pulled pork | 3.0 |
2 | bacon | 12.0 |
3 | Pastrami | 6.0 |
4 | corned beef | 7.5 |
5 | Bacon | 8.0 |
6 | pastrami | 3.0 |
7 | honey ham | 5.0 |
8 | nova lox | 6.0 |
food | ounces | animal | |
---|---|---|---|
0 | bacon | 4.0 | pig |
1 | pulled pork | 3.0 | pig |
2 | bacon | 12.0 | pig |
3 | Pastrami | 6.0 | cow |
4 | corned beef | 7.5 | cow |
5 | Bacon | 8.0 | pig |
6 | pastrami | 3.0 | cow |
7 | honey ham | 5.0 | pig |
8 | nova lox | 6.0 | salmon |
Replacing Values¶
0 | |
---|---|
0 | 1.0 |
1 | -999.0 |
2 | 2.0 |
3 | -999.0 |
4 | -1000.0 |
5 | 3.0 |
Renaming Axis Indexes¶
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
New York | 8 | 9 | 10 | 11 |
one | two | three | four | |
---|---|---|---|---|
OHIO | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
ONE | TWO | THREE | FOUR | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colo | 4 | 5 | 6 | 7 |
New | 8 | 9 | 10 | 11 |
one | two | peekaboo | four | |
---|---|---|---|---|
INDIANA | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
Detecting and Filtering Outliers¶
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.055077 | 0.045030 | -0.025765 | -0.000986 |
std | 0.993835 | 0.978408 | 0.978154 | 1.031257 |
min | -3.832766 | -3.076625 | -3.213987 | -3.419100 |
25% | -0.583017 | -0.613001 | -0.719985 | -0.683580 |
50% | 0.055401 | 0.039207 | -0.010730 | -0.012271 |
75% | 0.707915 | 0.764889 | 0.677561 | 0.673964 |
max | 3.739664 | 3.000617 | 2.741589 | 3.223334 |
0 | 1 | 2 | 3 | |
---|---|---|---|---|
111 | -1.237494 | 0.385405 | 1.708786 | 3.223334 |
186 | 0.129223 | -0.257331 | -1.856794 | -3.419100 |
308 | -0.324471 | 3.000617 | -1.237963 | -0.085539 |
356 | 3.192386 | 0.567717 | 0.174435 | 0.320250 |
420 | 0.679497 | -2.224678 | -3.213987 | -0.504708 |
425 | -3.832766 | -1.893903 | 0.340340 | -0.327178 |
473 | 0.363863 | -3.076625 | -0.344005 | -0.413175 |
485 | 0.124100 | 1.183545 | -3.023981 | 0.579325 |
541 | 3.190462 | -0.028757 | -0.047572 | 2.121152 |
753 | 3.739664 | 0.733766 | -0.229820 | 0.368722 |
823 | 1.983805 | -0.933073 | -1.768958 | -3.194717 |
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.054787 | 0.045106 | -0.025527 | -0.000596 |
std | 0.987260 | 0.978164 | 0.977406 | 1.028689 |
min | -3.000000 | -3.000000 | -3.000000 | -3.000000 |
25% | -0.583017 | -0.613001 | -0.719985 | -0.683580 |
50% | 0.055401 | 0.039207 | -0.010730 | -0.012271 |
75% | 0.707915 | 0.764889 | 0.677561 | 0.673964 |
max | 3.000000 | 3.000000 | 2.741589 | 3.000000 |
Permutation and Random Sampling¶
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
Computing Indicator/Dummy Variables¶
a | b | c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |