Data Wrangling with Pandas III#
We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!
Outline#
In this Class we will talk about:
Mid-Course Feedback
Aggregation and Grouping in Pandas
Method Chaining in Python
Additional Notes
Mid-Course Feedback#
Here is the mid-course feedback survey that I am hoping you can fill out during class.
Aggregation and Grouping#
Attribution#
This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum()
, mean()
, median()
, min()
, and max()
, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby
.
For convenience, we’ll use the same display
magic function that we’ve seen in previous sections:
import numpy as np
import pandas as pd
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return "\n".join(
self.template.format(a, eval(a)._repr_html_()) for a in self.args
)
def __repr__(self):
return "\n\n".join(a + "\n" + repr(eval(a)) for a in self.args)
Simple Aggregation in Pandas#
import pandas as pd
df = pd.DataFrame({"A": np.random.rand(5), "B": np.random.rand(5)})
df
A | B | |
---|---|---|
0 | 0.006191 | 0.364830 |
1 | 0.665987 | 0.715072 |
2 | 0.015547 | 0.367177 |
3 | 0.499678 | 0.196315 |
4 | 0.076605 | 0.776083 |
np.mean([0.734562, 0.723762, 0.075712, 0.683636, 0.878895])
0.6193134
np.mean(df["A"].tolist())
# df['A'].mean() # pure pandas way of doing it
0.25280146645003115
df.mean(axis="rows")
A 0.252801
B 0.483895
dtype: float64
By specifying the axis
argument, you can instead aggregate within each row:
df.mean(axis="columns")
0 0.185511
1 0.690529
2 0.191362
3 0.347996
4 0.426344
dtype: float64
Planets Data#
Here we will use the Planets dataset, available via the Seaborn package (see Visualization With Seaborn). It gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). It can be downloaded with a simple Seaborn command:
import seaborn as sns # `conda install -c conda-forge seaborn` in your terminal
planets = sns.load_dataset("planets")
planets
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300000 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774000 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030000 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220000 | 10.50 | 119.47 | 2009 |
... | ... | ... | ... | ... | ... | ... |
1030 | Transit | 1 | 3.941507 | NaN | 172.00 | 2006 |
1031 | Transit | 1 | 2.615864 | NaN | 148.00 | 2007 |
1032 | Transit | 1 | 3.191524 | NaN | 174.00 | 2007 |
1033 | Transit | 1 | 4.125083 | NaN | 293.00 | 2008 |
1034 | Transit | 1 | 4.187757 | NaN | 260.00 | 2008 |
1035 rows × 6 columns
planets.head(4)
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
This has some details on the 1,000+ extrasolar planets discovered up to 2014.
Pandas Series
and DataFrame
s include all of the common aggregates mentioned in Aggregations: Min, Max, and Everything In Between; in addition, there is a convenience method describe()
that computes several common aggregates for each column and returns the result.
Let’s use this on the Planets data, for now dropping rows with missing values:
planets.dropna()
#planets.dropna(subset=["method"])
#planets.dropna(thresh=3)
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.30000 | 7.100 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.77400 | 2.210 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.00000 | 2.600 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.03000 | 19.400 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.22000 | 10.500 | 119.47 | 2009 |
... | ... | ... | ... | ... | ... | ... |
640 | Radial Velocity | 1 | 111.70000 | 2.100 | 14.90 | 2009 |
641 | Radial Velocity | 1 | 5.05050 | 1.068 | 44.46 | 2013 |
642 | Radial Velocity | 1 | 311.28800 | 1.940 | 17.24 | 1999 |
649 | Transit | 1 | 2.70339 | 1.470 | 178.00 | 2013 |
784 | Radial Velocity | 3 | 580.00000 | 0.947 | 135.00 | 2012 |
498 rows × 6 columns
planets.describe()
number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|
count | 1035.000000 | 992.000000 | 513.000000 | 808.000000 | 1035.000000 |
mean | 1.785507 | 2002.917596 | 2.638161 | 264.069282 | 2009.070531 |
std | 1.240976 | 26014.728304 | 3.818617 | 733.116493 | 3.972567 |
min | 1.000000 | 0.090706 | 0.003600 | 1.350000 | 1989.000000 |
25% | 1.000000 | 5.442540 | 0.229000 | 32.560000 | 2007.000000 |
50% | 1.000000 | 39.979500 | 1.260000 | 55.250000 | 2010.000000 |
75% | 2.000000 | 526.005000 | 3.040000 | 178.500000 | 2012.000000 |
max | 7.000000 | 730000.000000 | 25.000000 | 8500.000000 | 2014.000000 |
This can be a useful way to begin understanding the overall properties of a dataset.
For example, we see in the year
column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2009 or after.
This is largely thanks to the Kepler mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.
The following table summarizes some other built-in Pandas aggregations:
Aggregation |
Description |
---|---|
|
Total number of items |
|
First and last item |
|
Mean and median |
|
Minimum and maximum |
|
Standard deviation and variance |
|
Mean absolute deviation |
|
Product of all items |
|
Sum of all items |
These are all methods of DataFrame
and Series
objects.
To go deeper into the data, however, simple aggregates are often not enough.
The next level of data summarization is the groupby
operation, which allows you to quickly and efficiently compute aggregates on subsets of data.
GroupBy: Split, Apply, Combine#
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby
operation.
The name “group by” comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.
Split, apply, combine#
A canonical example of this split-apply-combine operation, where the “apply” is a summation aggregation, is illustrated in this figure:
figure source in Appendix
This makes clear what the groupby
accomplishes:
The split step involves breaking up and grouping a
DataFrame
depending on the value of the specified key.The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
The combine step merges the results of these operations into an output array.
While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that the intermediate splits do not need to be explicitly instantiated. Rather, the GroupBy
can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the GroupBy
is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.
As a concrete example, let’s take a look at using Pandas for the computation shown in this diagram.
We’ll start by creating the input DataFrame
:
df = pd.DataFrame(
{"key": ["A", "B", "C", "A", "B", "C"], "data": range(6)}, columns=["key", "data"]
)
df.head(6)
key | data | |
---|---|---|
0 | A | 0 |
1 | B | 1 |
2 | C | 2 |
3 | A | 3 |
4 | B | 4 |
5 | C | 5 |
The most basic split-apply-combine operation can be computed with the groupby()
method of DataFrame
s, passing the name of the desired key column:
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2816b47730>
Notice that what is returned is not a set of DataFrame
s, but a DataFrameGroupBy
object.
This object is where the magic is: you can think of it as a special view of the DataFrame
, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This “lazy evaluation” approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.
To produce a result, we can apply an aggregate to this DataFrameGroupBy
object, which will perform the appropriate apply/combine steps to produce the desired result:
df.groupby("key").sum().reset_index()
key | data | |
---|---|---|
0 | A | 3 |
1 | B | 5 |
2 | C | 7 |
The sum()
method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid DataFrame
operation, as we will see in the following discussion.
The GroupBy object#
The GroupBy
object is a very flexible abstraction.
In many ways, you can simply treat it as if it’s a collection of DataFrame
s, and it does the difficult things under the hood. Let’s see some examples using the Planets data.
Perhaps the most important operations made available by a GroupBy
are aggregate, filter, transform, and apply.
We’ll discuss each of these more fully in “Aggregate, Filter, Transform, Apply”, but before that let’s introduce some of the other functionality that can be used with the basic GroupBy
operation.
Column indexing#
The GroupBy
object supports column indexing in the same way as the DataFrame
, and returns a modified GroupBy
object.
For example:
planets
# Exercise: groupby method, and find mean of mass column
# Solution: planets.dropna(subset=["mass"]).groupby("method").mean()[['mass']].reset_index()
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300000 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774000 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030000 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220000 | 10.50 | 119.47 | 2009 |
... | ... | ... | ... | ... | ... | ... |
1030 | Transit | 1 | 3.941507 | NaN | 172.00 | 2006 |
1031 | Transit | 1 | 2.615864 | NaN | 148.00 | 2007 |
1032 | Transit | 1 | 3.191524 | NaN | 174.00 | 2007 |
1033 | Transit | 1 | 4.125083 | NaN | 293.00 | 2008 |
1034 | Transit | 1 | 4.187757 | NaN | 260.00 | 2008 |
1035 rows × 6 columns
planets.groupby("method")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2816b477f0>
planets.groupby("method")["orbital_period"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f281695d090>
Here we’ve selected a particular Series
group from the original DataFrame
group by reference to its column name.
As with the GroupBy
object, no computation is done until we call some aggregate on the object:
planets.groupby("method")["orbital_period"].median()
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
Name: orbital_period, dtype: float64
This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.
Iteration over groups#
The GroupBy
object supports direct iteration over the groups, returning each group as a Series
or DataFrame
:
for (method, group) in planets.groupby("method"):
print(f"{method:30s} shape={group.shape}")
Astrometry shape=(2, 6)
Eclipse Timing Variations shape=(9, 6)
Imaging shape=(38, 6)
Microlensing shape=(23, 6)
Orbital Brightness Modulation shape=(3, 6)
Pulsar Timing shape=(5, 6)
Pulsation Timing Variations shape=(1, 6)
Radial Velocity shape=(553, 6)
Transit shape=(397, 6)
Transit Timing Variations shape=(4, 6)
This can be useful for doing certain things manually, though it is often much faster to use the built-in apply
functionality, which we will discuss momentarily.
Dispatch methods#
Through some Python class magic, any method not explicitly implemented by the GroupBy
object will be passed through and called on the groups, whether they are DataFrame
or Series
objects.
For example, you can use the describe()
method of DataFrame
s to perform a set of aggregations that describe each group in the data:
planets.groupby("method")["year"].describe().unstack()
method
count Astrometry 2.0
Eclipse Timing Variations 9.0
Imaging 38.0
Microlensing 23.0
Orbital Brightness Modulation 3.0
...
max Pulsar Timing 2011.0
Pulsation Timing Variations 2007.0
Radial Velocity 2014.0
Transit 2014.0
Transit Timing Variations 2014.0
Length: 80, dtype: float64
Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.
This is just one example of the utility of dispatch methods.
Notice that they are applied to each individual group, and the results are then combined within GroupBy
and returned.
Again, any valid DataFrame
/Series
method can be used on the corresponding GroupBy
object, which allows for some very flexible and powerful operations!
Aggregate, filter, transform, apply#
The preceding discussion focused on aggregation for the combine operation, but there are more options available.
In particular, GroupBy
objects have aggregate()
, filter()
, transform()
, and apply()
methods that efficiently implement a variety of useful operations before combining the grouped data.
For the purpose of the following subsections, we’ll use this DataFrame
:
rng = np.random.RandomState(0)
df = pd.DataFrame(
{
"key": ["A", "B", "C", "A", "B", "C"],
"data1": range(6),
"data2": rng.randint(0, 10, 6),
},
columns=["key", "data1", "data2"],
)
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
3 | A | 3 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
Aggregation#
We’re now familiar with GroupBy
aggregations with sum()
, median()
, and the like, but the aggregate()
method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:
df.groupby("key").aggregate(["min", np.median, max])
/tmp/ipykernel_2463/3430934023.py:1: FutureWarning: The provided callable <function median at 0x7f285c0d79a0> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
df.groupby("key").aggregate(["min", np.median, max])
/tmp/ipykernel_2463/3430934023.py:1: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
df.groupby("key").aggregate(["min", np.median, max])
data1 | data2 | |||||
---|---|---|---|---|---|---|
min | median | max | min | median | max | |
key | ||||||
A | 0 | 1.5 | 3 | 3 | 4.0 | 5 |
B | 1 | 2.5 | 4 | 0 | 3.5 | 7 |
C | 2 | 3.5 | 5 | 3 | 6.0 | 9 |
Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:
df.groupby("key").aggregate({"data1": "min",
"data2": "mean"})
data1 | data2 | |
---|---|---|
key | ||
A | 0 | 4.0 |
B | 1 | 3.5 |
C | 2 | 6.0 |
Filtering#
A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
def filter_func(x):
return x["data2"].std() > 4
display("df", "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
3 | A | 3 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
df.groupby('key').std()
data1 | data2 | |
---|---|---|
key | ||
A | 2.12132 | 1.414214 |
B | 2.12132 | 4.949747 |
C | 2.12132 | 4.242641 |
df.groupby('key').filter(filter_func)
key | data1 | data2 | |
---|---|---|---|
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.
Transformation#
While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:
df.groupby("key").transform(lambda x: x - x.mean())
data1 | data2 | |
---|---|---|
0 | -1.5 | 1.0 |
1 | -1.5 | -3.5 |
2 | -1.5 | -3.0 |
3 | 1.5 | -1.0 |
4 | 1.5 | 3.5 |
5 | 1.5 | 3.0 |
The apply() method#
The apply()
method lets you apply an arbitrary function to the group results.
The function should take a DataFrame
, and return either a Pandas object (e.g., DataFrame
, Series
) or a scalar; the combine operation will be tailored to the type of output returned.
For example, here is an apply()
that normalizes the first column by the sum of the second:
def norm_by_data2(x):
# x is a DataFrame of group values
x["data1"] /= x["data2"].sum() # "y/= something" is equivalent to "y = y/something"
return x
display("df", "df.groupby('key').apply(norm_by_data2)")
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
3 | A | 3 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
df.groupby('key').apply(norm_by_data2)
key | data1 | data2 | ||
---|---|---|---|---|
key | ||||
A | 0 | A | 0.000000 | 5 |
3 | A | 0.375000 | 3 | |
B | 1 | B | 0.142857 | 0 |
4 | B | 0.571429 | 7 | |
C | 2 | C | 0.166667 | 3 |
5 | C | 0.416667 | 9 |
apply()
within a GroupBy
is quite flexible: the only criterion is that the function takes a DataFrame
and returns a Pandas object or scalar; what you do in the middle is up to you!
Method Chaining#
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
data = load_wine() # this a data file that gets loaded
df = pd.DataFrame(data["data"],
columns=data["feature_names"])
df.head()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[27], line 3
1 import pandas as pd
2 import numpy as np
----> 3 from sklearn.datasets import load_wine
5 data = load_wine() # this a data file that gets loaded
7 df = pd.DataFrame(data["data"],
8 columns=data["feature_names"])
ModuleNotFoundError: No module named 'sklearn'
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
data = load_wine() # this a data file that gets loaded
# Method chaining begins
df = (
pd.DataFrame(data.data,columns=data.feature_names)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.loc[lambda x: x['alcohol']>14]
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
df
data.keys()
df = pd.DataFrame(data.data,columns=data.feature_names)
df = df.rename(columns={"color_intensity": "ci"})
df
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | ci | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173 | 13.71 | 5.65 | 2.45 | 20.5 | 95.0 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740.0 |
174 | 13.40 | 3.91 | 2.48 | 23.0 | 102.0 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750.0 |
175 | 13.27 | 4.28 | 2.26 | 20.0 | 120.0 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835.0 |
176 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 |
177 | 14.13 | 4.10 | 2.74 | 24.5 | 96.0 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560.0 |
178 rows × 13 columns
df["color_filter"] = np.where((df["hue"] > 1) &
(df["ci"] > 7), 1, 0)
df.head()
# Inspired from Source:
# https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | ci | hue | od280/od315_of_diluted_wines | proline | color_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
df = (
pd.DataFrame(data.data, columns=data.feature_names)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.loc[lambda x: x["alcohol"] > 14]
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
df
alcohol | ci | hue | |
---|---|---|---|
0 | 14.83 | 5.20 | 1.08 |
1 | 14.75 | 5.40 | 1.25 |
2 | 14.39 | 5.25 | 1.02 |
3 | 14.38 | 4.90 | 1.04 |
4 | 14.38 | 7.50 | 1.20 |
5 | 14.37 | 7.80 | 0.86 |
6 | 14.34 | 13.00 | 0.57 |
7 | 14.30 | 6.20 | 1.07 |
8 | 14.23 | 5.64 | 1.04 |
9 | 14.22 | 5.10 | 0.89 |
10 | 14.22 | 6.38 | 0.94 |
11 | 14.21 | 5.24 | 0.87 |
12 | 14.20 | 6.75 | 1.05 |
13 | 14.19 | 8.70 | 1.23 |
14 | 14.16 | 9.70 | 0.62 |
15 | 14.13 | 9.20 | 0.61 |
16 | 14.12 | 5.00 | 1.17 |
17 | 14.10 | 6.20 | 1.07 |
18 | 14.10 | 5.75 | 1.25 |
19 | 14.06 | 5.65 | 1.09 |
20 | 14.06 | 5.05 | 1.06 |
21 | 14.02 | 4.70 | 1.04 |
There is LOTS more to explore!#
Below are some additional notes that show the (additional) power of combining many of the operations we’ve discussed up to this point when looking at realistic datasets. We can gain a thorough understanding of when and how planets have been discovered over the past several decades!
I would suggest digging into these few lines of code, and evaluating the individual steps to make sure you understand exactly what they are doing to the result. It’s certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data.
That’s it!
Have a good weekend!
Additional Notes#
Specifying the split key#
In the simple examples presented before, we split the DataFrame
on a single column name.
This is just one of many options by which the groups can be defined, and we’ll go through some other options for group specification here.
A list, array, series, or index providing the grouping keys#
The key can be any series or list with a length matching that of the DataFrame
. For example:
L = [0, 1, 0, 1, 2, 0]
display("df", "df.groupby(L).sum()")
Of course, this means there’s another, more verbose way of accomplishing the df.groupby('key')
from before:
display("df", "df.groupby(df['key']).sum()")
A dictionary or series mapping index to group#
Another method is to provide a dictionary that maps index values to the group keys:
df2 = df.set_index("key")
mapping = {"A": "vowel", "B": "consonant", "C": "consonant"}
display("df2", "df2.groupby(mapping).sum()")
Any Python function#
Similar to mapping, you can pass any Python function that will input the index value and output the group:
display("df2", "df2.groupby(str.lower).mean()")
A list of valid keys#
Further, any of the preceding key choices can be combined to group on a multi-index:
df2.groupby([str.lower, mapping]).mean()
Grouping example#
As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:
planets
decade = 10 * (planets["year"] // 10)
decade
decade = decade.astype(str) + "s"
decade
decade.name = "decade"
planets.groupby(["method", decade])["number"].sum().unstack().fillna(0)