day 4: getting started with data visualization¶

overview day 4
section 4.2.1 python syntax
- variables and data types
section 4.2.2 functions
- functions and modules
section 4.2.3 files
- files and directories

section 4.2.1¶

section 4.2.2¶

some line

section 4.2.3¶

some line

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#!pip install seaborn
import seaborn as sns

#%matplotlib inline
#%reload_ext autoreload
#%autoreload 2

#dfc = pd.read_csv('some.csv')
print(sns.get_dataset_names())
dfc = sns.load_dataset('car_crashes')
dfc.head()

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'geyser', 'iris', 'mpg', 'penguins', 'planets', 'tips', 'titanic']

sns.displot(dfc['not_distracted'])

<seaborn.axisgrid.FacetGrid at 0x7fb861c292e0>

Quickly Creating Summary Counts in Pandas¶

Let's next count the number of samples for each species. We can do this in a few ways, but we'll use groupby combined with a count() method.

surveys_df = pd.read_csv("../../csv/surveys.csv")

# Count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)

species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: record_id, dtype: int64

Or, we can also count just the rows that have the species "DO":

surveys_df.groupby('species_id')['record_id'].count()['DO']

3027

Basic Math Functions¶

If we wanted to, we could perform math on an entire column of our data. For example let's multiply all weight values by 2. A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data.

# Multiply all weight values by 2 but does not change the original weight data
surveys_df['weight']*2

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
35544      NaN
35545      NaN
35546     28.0
35547    102.0
35548      NaN
Name: weight, Length: 35549, dtype: float64

Quick & Easy Plotting Data Using Pandas¶

We can plot our summary stats using Pandas, too.

## To make sure figures appear inside Jupyter Notebook
%matplotlib inline

# Create a quick bar chart
species_counts.plot(kind='bar')

<AxesSubplot:xlabel='species_id'>

Animals per site plot¶

We can also look at how many animals were captured in each site.

total_count = surveys_df.groupby('site_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar')

<AxesSubplot:xlabel='site_id'>

Extra Plotting Challenge¶

Create a plot of average weight across all species per plot.
Create a plot of total males versus total females for the entire dataset.
Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge: For more on Pandas plots, visit this link.

Solution to Extra Plotting Challenge 1¶

## Solution Plotting Challenge 1
surveys_df.groupby('site_id').mean()["weight"].plot(kind='bar')

<AxesSubplot:xlabel='site_id'>

Solution to Extra Plotting Challenge 2¶

# Solution Plotting Challenge 2
## Create plot of total males versus total females for the entire dataset.

surveys_df.groupby('sex').count()["record_id"].plot(kind='bar')

<AxesSubplot:xlabel='sex'>

Solution to Extra Plotting Challenge 3¶

First we group data by site and by sex, and then calculate a total for each site.

by_site_sex = surveys_df.groupby(['site_id','sex'])
site_sex_count = by_site_sex['weight'].sum()

This calculates the sums of weights for each sex within each plot as a table

site  sex
site_id  sex
1        F      38253
         M      59979
2        F      50144
         M      57250
3        F      27251
         M      28253
4        F      39796
         M      49377
<other sites removed for brevity>

Below we'll use .unstack() on our grouped data to figure out the total weight that each sex contributed to each plot.

by_site_sex = surveys_df.groupby(['site_id','sex'])
site_sex_count = by_site_sex['weight'].sum()
site_sex_count.unstack()

Now, create a stacked bar plot with that data where the weights for each sex are stacked by plot.

Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:

by_site_sex = surveys_df.groupby(['site_id', 'sex'])
site_sex_count = by_site_sex['weight'].sum()
spc = site_sex_count.unstack()
s_plot = spc.plot(kind='bar', stacked=True, title="Total weight by site and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Site")

Text(0.5, 0, 'Site')

	total	speeding	alcohol	not_distracted	no_previous	ins_premium	ins_losses	abbrev
0	18.8	7.332	5.640	18.048	15.040	784.55	145.08	AL
1	18.1	7.421	4.525	16.290	17.014	1053.48	133.93	AK
2	18.6	6.510	5.208	15.624	17.856	899.47	110.35	AZ
3	22.4	4.032	5.824	21.056	21.280	827.34	142.39	AR
4	12.0	4.200	3.360	10.920	10.680	878.41	165.63	CA

sex	F	M
site_id
1	38253.0	59979.0
2	50144.0	57250.0
3	27251.0	28253.0
4	39796.0	49377.0
5	21143.0	23326.0
6	26210.0	27245.0
7	6522.0	6422.0
8	37274.0	47755.0
9	44128.0	48727.0
10	2359.0	2776.0
11	34638.0	43106.0
12	51825.0	57420.0
13	24720.0	30354.0
14	32770.0	46469.0
15	12455.0	11037.0
16	5446.0	6310.0
17	42106.0	48082.0
18	27353.0	26433.0
19	11297.0	11514.0
20	33206.0	25988.0
21	15481.0	9815.0
22	34656.0	35363.0
23	3352.0	3883.0
24	22951.0	18835.0