In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#!pip install seaborn
import seaborn as sns

#%matplotlib inline
#%reload_ext autoreload
#%autoreload 2
In [5]:
#dfc = pd.read_csv('some.csv')
print(sns.get_dataset_names())
dfc = sns.load_dataset('car_crashes')
dfc.head()
['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'geyser', 'iris', 'mpg', 'penguins', 'planets', 'tips', 'titanic']
Out[5]:
total speeding alcohol not_distracted no_previous ins_premium ins_losses abbrev
0 18.8 7.332 5.640 18.048 15.040 784.55 145.08 AL
1 18.1 7.421 4.525 16.290 17.014 1053.48 133.93 AK
2 18.6 6.510 5.208 15.624 17.856 899.47 110.35 AZ
3 22.4 4.032 5.824 21.056 21.280 827.34 142.39 AR
4 12.0 4.200 3.360 10.920 10.680 878.41 165.63 CA
In [6]:
sns.displot(dfc['not_distracted'])
Out[6]:
<seaborn.axisgrid.FacetGrid at 0x7fb861c292e0>

Quickly Creating Summary Counts in Pandas

Let's next count the number of samples for each species. We can do this in a few ways, but we'll use groupby combined with a count() method.

In [7]:
surveys_df = pd.read_csv("../../csv/surveys.csv")

# Count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)
species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: record_id, dtype: int64

Or, we can also count just the rows that have the species "DO":

In [8]:
surveys_df.groupby('species_id')['record_id'].count()['DO']
Out[8]:
3027

Basic Math Functions

If we wanted to, we could perform math on an entire column of our data. For example let's multiply all weight values by 2. A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data.

In [9]:
# Multiply all weight values by 2 but does not change the original weight data
surveys_df['weight']*2
Out[9]:
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
35544      NaN
35545      NaN
35546     28.0
35547    102.0
35548      NaN
Name: weight, Length: 35549, dtype: float64

Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.

In [10]:
## To make sure figures appear inside Jupyter Notebook
%matplotlib inline

# Create a quick bar chart
species_counts.plot(kind='bar')
Out[10]:
<AxesSubplot:xlabel='species_id'>

Animals per site plot

We can also look at how many animals were captured in each site.

In [11]:
total_count = surveys_df.groupby('site_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar')
Out[11]:
<AxesSubplot:xlabel='site_id'>

Extra Plotting Challenge

  1. Create a plot of average weight across all species per plot.

  2. Create a plot of total males versus total females for the entire dataset.

  3. Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge: For more on Pandas plots, visit this link.

Solution to Extra Plotting Challenge 1

In [12]:
## Solution Plotting Challenge 1
surveys_df.groupby('site_id').mean()["weight"].plot(kind='bar')
Out[12]:
<AxesSubplot:xlabel='site_id'>

Solution to Extra Plotting Challenge 2

In [13]:
# Solution Plotting Challenge 2
## Create plot of total males versus total females for the entire dataset.

surveys_df.groupby('sex').count()["record_id"].plot(kind='bar')
Out[13]:
<AxesSubplot:xlabel='sex'>

Solution to Extra Plotting Challenge 3

First we group data by site and by sex, and then calculate a total for each site.

In [14]:
by_site_sex = surveys_df.groupby(['site_id','sex'])
site_sex_count = by_site_sex['weight'].sum()

This calculates the sums of weights for each sex within each plot as a table

site  sex
site_id  sex
1        F      38253
         M      59979
2        F      50144
         M      57250
3        F      27251
         M      28253
4        F      39796
         M      49377
<other sites removed for brevity>

Below we'll use .unstack() on our grouped data to figure out the total weight that each sex contributed to each plot.

In [15]:
by_site_sex = surveys_df.groupby(['site_id','sex'])
site_sex_count = by_site_sex['weight'].sum()
site_sex_count.unstack()
Out[15]:
sex F M
site_id
1 38253.0 59979.0
2 50144.0 57250.0
3 27251.0 28253.0
4 39796.0 49377.0
5 21143.0 23326.0
6 26210.0 27245.0
7 6522.0 6422.0
8 37274.0 47755.0
9 44128.0 48727.0
10 2359.0 2776.0
11 34638.0 43106.0
12 51825.0 57420.0
13 24720.0 30354.0
14 32770.0 46469.0
15 12455.0 11037.0
16 5446.0 6310.0
17 42106.0 48082.0
18 27353.0 26433.0
19 11297.0 11514.0
20 33206.0 25988.0
21 15481.0 9815.0
22 34656.0 35363.0
23 3352.0 3883.0
24 22951.0 18835.0

Now, create a stacked bar plot with that data where the weights for each sex are stacked by plot.

Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:

In [16]:
by_site_sex = surveys_df.groupby(['site_id', 'sex'])
site_sex_count = by_site_sex['weight'].sum()
spc = site_sex_count.unstack()
s_plot = spc.plot(kind='bar', stacked=True, title="Total weight by site and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Site")
Out[16]:
Text(0.5, 0, 'Site')
In [ ]:
 
In [ ]: