What The Plot

19 Jul 2024 · 13 min read

Seaborn plots exploration

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

Relational Plots

These plots show relationships between two or more variables.

  • scatterplot: sns.scatterplot()

relplot: A figure-level function for creating scatter and line plots.

movies = pd.read_csv('/home/annie/Python/data/MoviesOnStreamingPlatforms.csv')
movies.head()

Unnamed: 0IDTitleYearAgeRotten TomatoesNetflixHuluPrime VideoDisney+Type
001The Irishman201918+98/10010000
112Dangal20167+97/10010000
223David Attenborough: A Life on Our Planet20207+95/10010000
334Lagaan: Once Upon a Time in India20017+94/10010000
445Roma201818+94/10010000
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9515 entries, 0 to 9514
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       9515 non-null   int64 
 1   ID               9515 non-null   int64 
 2   Title            9515 non-null   object
 3   Year             9515 non-null   int64 
 4   Age              5338 non-null   object
 5   Rotten Tomatoes  9508 non-null   object
 6   Netflix          9515 non-null   int64 
 7   Hulu             9515 non-null   int64 
 8   Prime Video      9515 non-null   int64 
 9   Disney+          9515 non-null   int64 
 10  Type             9515 non-null   int64 
dtypes: int64(8), object(3)
memory usage: 817.8+ KB
movie_ratings_df = movies.copy().drop(columns=['Unnamed: 0', 'ID'])
movie_ratings_df['ratings'] = movie_ratings_df['Rotten Tomatoes'].str.replace('/100', '').fillna('0').astype(int)
movie_ratings_df.head()
# movie_ratings_df[movie_ratings_df['ratings'] == '0']

TitleYearAgeRotten TomatoesNetflixHuluPrime VideoDisney+Typeratings
0The Irishman201918+98/1001000098
1Dangal20167+97/1001000097
2David Attenborough: A Life on Our Planet20207+95/1001000095
3Lagaan: Once Upon a Time in India20017+94/1001000094
4Roma201818+94/1001000094
# movie_ratings_df.info()
disney_df = movie_ratings_df.copy().drop(columns=['Netflix', 'Hulu', 'Prime Video'])
disney_df = disney_df[disney_df['Disney+'] == 1]
disney_df.head()

TitleYearAgeRotten TomatoesDisney+Typeratings
270White Fang20187+76/1001076
712Muppets Most Wanted20147+67/1001067
1330Zapped2014all59/1001059
1813The Blue Umbrella2005NaN54/1001054
2029Sky High2020NaN51/1001051
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Year', 
                y='ratings', 
                color='hotpink',
                data=disney_df)
plt.title('Rotten Tomatoes rating for Disney movies by year')
plt.ylabel('Ratings out of 100')
plt.yticks(ticks=range(0,100,5))
plt.xticks(ticks=range(1920, 2024, 10))

#using regplot to add a regression line

# sns.regplot(x='Year', 
#             y='ratings', 
#             data=disney_df,
#             scatter=False, # Disable scatter points to avoid duplicate points
#             color='skyblue')
# plt.ylabel('Ratings out of 100')
# plt.yticks(ticks=range(0,100,5))
plt.show()

png

disney_df[disney_df['ratings']<20].sort_values(by='Year', ascending=False)

TitleYearAgeRotten TomatoesDisney+Typeratings
9503Disney My Music Story: Sukima Switch202116+16/1001016
9509Built for Mars: The Perseverance Rover2021NaN14/1001014
9502Sharkcano2020NaN16/1001016
9507Texas Storm Squad202013+14/1001014
9508What the Shark?202013+14/1001014
9510Most Wanted Sharks2020NaN14/1001014
9511Doc McStuffins: The Doc Is In2020NaN13/1001013
9505Great Shark Chow Down20197+14/1001014
9512Ultimate Viking Sword2019NaN13/1001013
9514Women of Impact: Changing the World20197+10/1001010
9504Big Cat Games2015NaN15/1001015
9513Hunt for the Abominable Snowman2011NaN10/1001010
9506In Beaver Valley1950NaN14/1001014

Regression line

  • trends –> visual representaion of the relationship between x and y
  • can be used to predict future values
  • summarises the overall direction and strength of the relationship
  • identifies outliers
# Using sns.lmplot
sns.lmplot(x='Year', 
                y='ratings', 
                height=6,
                aspect=1.5,
                data=disney_df,
                scatter_kws={'color': 'turquoise'},
                line_kws={'color': 'hotpink'})  # Change the color of the regression line

plt.title('Rotten Tomatoes rating for Disney movies by year')
plt.ylabel('Ratings out of 100')
plt.yticks(ticks=range(0,105,5))
plt.xticks(ticks=range(1920, 2024, 10))
plt.show()

png

  • Regression Line: represents the trend of the data. The shaded area around the line represents the confidence interval, indicating the uncertainty of the regression estimate.
  • The regression line is relatively flat, suggesting that there is no strong trend in the ratings over time. This implies that the average Rotten Tomatoes rating for Disney movies has remained relatively stable over the decades.
  • There is significant variability in the ratings, especially in recent years (from the 1990s onwards), indicating that Disney has released movies with both very high and very low ratings. Earlier years (1930s to 1950s) show fewer movies, with a tendency towards lower ratings compared to later years.
  • Recent Years: The dense clustering of data points in recent years indicates that Disney has released a higher number of movies. The ratings for these movies vary widely, but there is no clear upward or downward trend in average ratings.
  • Flat Regression Line: The lack of a strong slope in the regression line suggests that, on average, Disney movies’ Rotten Tomatoes ratings have not significantly improved or declined over time.
  • High Variability: The wide spread of points indicates that Disney has produced a diverse range of movies, with some receiving very high ratings and others very low ratings, particularly in recent decades.

Categorical Plots

These plots are used to show the distribution of data across different categories.

  • stripplot: sns.stripplot()
  • swarmplot: sns.swarmplot()
  • boxplot: sns.boxplot()
  • boxenplot (an enhanced box plot): sns.boxenplot()
  • violinplot: sns.violinplot()

catplot: A figure-level function for creating categorical plots.

penguin_lter = pd.read_csv('/home/annie/Python/data/penguins_lter.csv')
penguin_df = penguin_lter.dropna(subset=['Sex']) #drop NaNs from Sex column
penguin_df = penguin_df.drop(columns=['Sample Number', 'Individual ID', 'Stage', 'Clutch Completion', 'Clutch Completion', 'Date Egg', 'Comments'])
penguin_df = penguin_df[penguin_df['Sex'] != '.']
penguin_df.head()

# penguin_df.info()

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
0PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.118.7181.03750.0MALENaNNaN
1PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.517.4186.03800.0FEMALE8.94956-24.69454
2PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen40.318.0195.03250.0FEMALE8.36821-25.33302
4PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen36.719.3193.03450.0FEMALE8.76651-25.32426
5PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.320.6190.03650.0MALE8.66496-25.29805
# penguin_lter['Clutch Completion'].unique()
# penguin_lter['studyName'].unique()
# penguin_lter['Date Egg'].unique()
# penguin_lter['Stage'].unique()
# penguin_clean[penguin_clean['Sex']=='.']
penguin_df.head()

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
0PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.118.7181.03750.0MALENaNNaN
1PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.517.4186.03800.0FEMALE8.94956-24.69454
2PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen40.318.0195.03250.0FEMALE8.36821-25.33302
4PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen36.719.3193.03450.0FEMALE8.76651-25.32426
5PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.320.6190.03650.0MALE8.66496-25.29805
penguin_df.sample(5)

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
309PAL0910Gentoo penguin (Pygoscelis papua)AnversBiscoe52.117.0230.05550.0MALE8.27595-26.11657
294PAL0809Gentoo penguin (Pygoscelis papua)AnversBiscoe46.415.0216.04700.0FEMALE8.47938-26.95470
50PAL0809Adelie Penguin (Pygoscelis adeliae)AnversBiscoe39.617.7186.03500.0FEMALE8.46616-26.12989
157PAL0708Chinstrap penguin (Pygoscelis antarctica)AnversDream45.217.8198.03950.0FEMALE8.88942-24.49433
111PAL0910Adelie Penguin (Pygoscelis adeliae)AnversBiscoe45.620.3191.04600.0MALE8.65466-26.32909
penguin_df['Sex'].unique()
array(['MALE', 'FEMALE'], dtype=object)
#penguin_df

sns.boxplot(data=penguin_df, x='Species', y='Body Mass (g)', palette='mako', hue='Species')
plt.xticks(rotation=45)  
plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo']) # rename ticker labels
plt.title('Body Mass Distribution by species')
plt.show()

png

sns.boxenplot(data=penguin_df, x='Sex', y='Body Mass (g)', palette='viridis', hue='Sex')
plt.xticks(rotation=45)  
plt.title('Body Mass Distribution by sex')
plt.show()

png

# #disney_df
# sns.boxplot(x=disney_df['ratings'])
# plt.show()

# sns.boxenplot(x=disney_df['ratings'])
# plt.show()
sns.violinplot(data=penguin_df, x='Species', y='Culmen Length (mm)', hue='Sex', palette='mako')
plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo']) # rename ticker labels
plt.legend(loc="upper left")
plt.title('Culmen Length Distribution by species and sex')
plt.show()

png

# sns.stripplot(x=disney_df['ratings'])
# plt.show()

sns.stripplot(x='Culmen Depth (mm)', y='Species', hue='Species', data=penguin_df)
plt.yticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])
plt.title('Culmen Depth Distribution by species')
plt.show()

png

# sns.swarmplot(x=disney_df['ratings'])
# plt.show()

sns.swarmplot(x='Flipper Length (mm)',y='Species', hue='Species', data=penguin_df)
plt.yticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])
plt.title('Flipper Length Distribution by species')
plt.show()

png

Distribution Plots

These plots show the distribution of a single variable.

  • histplot (aka histogram): sns.histplot()
  • kdeplot (Kernel Density Estimate plot): sns.kdeplot()
  • ecdfplot (Empirical Cumulative Distribution Function): sns.ecdfplot()

displot: A figure-level function for creating histograms and KDE plots.

disney_df.head(2)

TitleYearAgeRotten TomatoesDisney+Typeratings
270White Fang20187+76/1001076
712Muppets Most Wanted20147+67/1001067
penguin_df.head(2)
# penguin_df['Island'].unique()

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
0PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.118.7181.03750.0MALENaNNaN
1PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.517.4186.03800.0FEMALE8.94956-24.69454
sns.histplot(disney_df, x='Year', 
             kde=True, 
             color='hotpink')
plt.title('Distribution of Disney Movies between 1920-2020')
plt.show()

png

sns.kdeplot(disney_df, x='Year', color='hotpink')
plt.show()

png

sns.histplot(data=penguin_df, x='Body Mass (g)')
plt.show()

png

sns.kdeplot(x=disney_df['ratings']) #'Year'
plt.show()

png

sns.ecdfplot(x=disney_df['ratings'])
plt.show()

png

Matrix Plots

These plots are used to visualize data in matrix form.

  • heatmap: sns.heatmap()
  • clustermap (hierarchically-clustered heatmap): sns.clustermap()
adelie_matrix = penguin_df[penguin_df['Species']=='Adelie Penguin (Pygoscelis adeliae)']
adelie_matrix = adelie_matrix.drop(columns=['Species','studyName', 'Region', 'Island', 'Sex', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)'])
adelie_matrix.head()

Culmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)
039.118.7181.03750.0
139.517.4186.03800.0
240.318.0195.03250.0
436.719.3193.03450.0
539.320.6190.03650.0
# sns.heatmap(df)
plt.figure(figsize=(8, 6))
sns.heatmap(adelie_matrix.corr(), cmap='mako_r')
plt.title('Correlation Heatmap of Adelie Penguin Measurements')
plt.show()

png

Multi-Plot Grids

These are used for plotting multiple plots in a grid layout.

  • FacetGrid: sns.FacetGrid()
  • PairGrid: sns.PairGrid()
  • pairplot: sns.pairplot()
penguins = sns.load_dataset('penguins')
penguins.head()

speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0Male
1AdelieTorgersen39.517.4186.03800.0Female
2AdelieTorgersen40.318.0195.03250.0Female
3AdelieTorgersenNaNNaNNaNNaNNaN
4AdelieTorgersen36.719.3193.03450.0Female
peng = sns.FacetGrid(penguins, col='species', #creates separate columns for each unique value in the Species column
                     row='sex', hue='sex', palette="mako_r", sharex=False)
peng.map(sns.histplot, 'bill_length_mm')
# plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])

plt.show()

png

penguin_df.head(2)

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
0PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.118.7181.03750.0MALENaNNaN
1PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.517.4186.03800.0FEMALE8.94956-24.69454
sns.pairplot(x_vars = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)'],
             y_vars = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)'],
             hue = 'Species',
             data = penguin_df)
plt.suptitle('Pairplot of Penguin Measurements by Species', y=1.02)
plt.show()

png

# sns.PairGrid()
pg = sns.PairGrid(penguin_df,
                 x_vars = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)'],
                 y_vars = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)'],
                 hue='Species',
                 palette='cubehelix')
pg.map(sns.scatterplot)
pg.add_legend()
plt.suptitle('Pair Grid showing Penguin Measurements by Species', y=1.02)

plt.show()

png

pg2 = sns.PairGrid(penguin_df,
                 x_vars = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)'],
                 y_vars = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)'],
                 hue='Species',
                 palette='cubehelix')
pg2.map_upper(sns.scatterplot)
pg2.map_lower(sns.kdeplot)
pg2.map_diag(sns.histplot)
# pg.add_legend()
# plt.suptitle('Pair Grid showing Penguin Measurements by Species', y=1.02)

plt.show()

png

Joint Plots

These combine univariate and bivariate plots to show relationships between two variables.

  • jointplot: sns.jointplot()
  • JointGrid: sns.JointGrid()
iris = sns.load_dataset('iris')
iris.sample(5)

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
365.53.51.30.2setosa
465.13.81.60.2setosa
526.93.14.91.5versicolor
1406.73.15.62.4virginica
1064.92.54.51.7virginica
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
# sns.jointplot()

ir = sns.jointplot(data=iris, x="sepal_length", y="petal_length", 
              hue='species', 
              palette='husl')
plt.show()

png

# sns.jointplot()

ir = sns.jointplot(data=iris, x="sepal_length", y="petal_length", kind='hex')
plt.show()

png

# sns.JointGrid()

ir2 = sns.JointGrid(data=iris, x="sepal_width", y="petal_width",
              hue='species', 
              palette='husl')

ir2.plot(sns.scatterplot, sns.histplot)
plt.show()

png

# sns.JointGrid()

ir2 = sns.JointGrid(data=iris, x="sepal_width", y="petal_width",
              hue='species')
                   # height=8)

ir2.plot_joint(sns.scatterplot, s=100, palette='husl', marker="+")
ir2.plot_marginals(sns.histplot, kde=True, palette='mako')
plt.show()

png

Time Series Plots

These plots are used to visualise time series data.

  • lineplot: sns.lineplot()
disney_df.head(2)

TitleYearAgeRotten TomatoesDisney+Typeratings
270White Fang20187+76/1001076
712Muppets Most Wanted20147+67/1001067
plt.figure(figsize=(8,6))
sns.lineplot(x='Year', 
                y='ratings', 
                data=disney_df,
                color='hotpink')
plt.title('Disney ratings over time')
plt.xticks(ticks=range(1920, 2024, 10)) 
plt.show()

png

Statistical Estimation

These plots are used to show statistical estimates of the data.

  • barplot: sns.barplot()
  • pointplot: sns.pointplot()
  • countplot: sns.countplot()
penguin_df.head(2)

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
0PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.118.7181.03750.0MALENaNNaN
1PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.517.4186.03800.0FEMALE8.94956-24.69454
adelie = penguin_df[penguin_df['Species']=='Adelie Penguin (Pygoscelis adeliae)']
adelie

studyNameSpeciesRegionIslandCulmen Length (mm)Culmen Depth (mm)Flipper Length (mm)Body Mass (g)SexDelta 15 N (o/oo)Delta 13 C (o/oo)
0PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.118.7181.03750.0MALENaNNaN
1PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.517.4186.03800.0FEMALE8.94956-24.69454
2PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen40.318.0195.03250.0FEMALE8.36821-25.33302
4PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen36.719.3193.03450.0FEMALE8.76651-25.32426
5PAL0708Adelie Penguin (Pygoscelis adeliae)AnversTorgersen39.320.6190.03650.0MALE8.66496-25.29805
....................................
147PAL0910Adelie Penguin (Pygoscelis adeliae)AnversDream36.618.4184.03475.0FEMALE8.68744-25.83060
148PAL0910Adelie Penguin (Pygoscelis adeliae)AnversDream36.017.8195.03450.0FEMALE8.94332-25.79189
149PAL0910Adelie Penguin (Pygoscelis adeliae)AnversDream37.818.1193.03750.0MALE8.97533-26.03495
150PAL0910Adelie Penguin (Pygoscelis adeliae)AnversDream36.017.1187.03700.0FEMALE8.93465-26.07081
151PAL0910Adelie Penguin (Pygoscelis adeliae)AnversDream41.518.5201.04000.0MALE8.89640-26.06967

146 rows × 11 columns

sns.barplot(data=penguin_df, x='Species', y='Culmen Length (mm)', color='lavender')
plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])
plt.title('Culmen Length of Penguins by species')
plt.show()

png

sns.barplot(data=penguin_df, x='Species', y='Culmen Length (mm)', hue='Sex', palette='mako')
plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])
plt.title('Culmen Length of Penguins by sex and species')
plt.show()

png

titanic = sns.load_dataset('titanic')
titanic.head()

survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
sns.barplot(data=titanic, x='pclass', y='age', palette='viridis', hue='alive')
plt.title('Passenger Age by Class and Survival Status on the Titanic')
plt.show()

png

# sns.pointplot()

sns.pointplot(data=penguin_df, x='Species', y='Culmen Length (mm)', hue='Sex', palette='viridis')
plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])
plt.show()

png

# sns.countplot()
sns.countplot(data=penguin_df, x='Species', color='turquoise')
plt.xticks(ticks=[0, 1, 2], labels=['Adelie', 'Chinstrap', 'Gentoo'])
plt.show()

png

penguin_df.Species.value_counts()#.sum()
Species
Adelie Penguin (Pygoscelis adeliae)          146
Gentoo penguin (Pygoscelis papua)            119
Chinstrap penguin (Pygoscelis antarctica)     68
Name: count, dtype: int64
(152/344)*100
44.18604651162791
(124/344)*100
36.04651162790697
(68/344)*100
19.767441860465116