Deep Sea Coral Analysis¶

This dataset comes from NOAA and contains information specifically about deep sea corals world wide. More technical information can be found here: https://www.kaggle.com/datasets/noaa/deep-sea-corals at the Kaggle page.

Exploration¶

During the exploration of this dataset we will focus mainly on locality questions: what species are located where in the water column and geographically in the world?

The general steps taken will be as follows:

  • Load and clean the dataset. Remove any columns that are not needed for the analysis and remove any rows that have missing data. Format any of the columns to dates or numbers as needed. We will wait to do any encoding until after we have explored the dataset further.

  • Explore the data by plotting the data in various ways to see what the data looks like. We want to anaylze what time ranges we are working with, where geographically the dataset represents, how many species are recorded, and how many records we have for each species.

  • Analyze only a subset of species. Perhaps the top 10 most common species or the top 50% of species most commonly recorded in the dataset.

In [1]:
import warnings

# disable future,user,depreciation warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
In [2]:
df = pd.read_csv('deep_sea_corals.csv')

df.info()
/tmp/ipykernel_57624/462335268.py:1: DtypeWarning: Columns (5,7,8,13) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('deep_sea_corals.csv')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513373 entries, 0 to 513372
Data columns (total 20 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   CatalogNumber            513372 non-null  float64
 1   DataProvider             513372 non-null  object 
 2   ScientificName           513372 non-null  object 
 3   VernacularNameCategory   513197 non-null  object 
 4   TaxonRank                513364 non-null  object 
 5   Station                  253590 non-null  object 
 6   ObservationDate          513367 non-null  object 
 7   latitude                 513373 non-null  object 
 8   longitude                513373 non-null  object 
 9   DepthInMeters            513372 non-null  float64
 10  DepthMethod              496845 non-null  object 
 11  Locality                 389645 non-null  object 
 12  LocationAccuracy         484662 non-null  object 
 13  SurveyID                 306228 non-null  object 
 14  Repository               496584 non-null  object 
 15  IdentificationQualifier  488591 non-null  object 
 16  EventID                  472141 non-null  object 
 17  SamplingEquipment        485883 non-null  object 
 18  RecordType               501077 non-null  object 
 19  SampleID                 402294 non-null  object 
dtypes: float64(2), object(18)
memory usage: 78.3+ MB
In [3]:
print('unique taxon identifiers: ', df.TaxonRank.unique())
print('sampling equipment utilized: ', df.SamplingEquipment.value_counts())
unique taxon identifiers:  [nan 'species' 'genus' 'phylum' 'order' 'family' 'suborder' 'subgenus'
 'subspecies' 'variety' 'class' 'forma' 'subfamily' 'subclass']
sampling equipment utilized:  SamplingEquipment
ROV                    326289
submersible             70268
trawl                   51899
towed camera            19626
longline                 9481
dredge                   2840
AUV                      2535
drop camera              1262
grab                      621
net                       504
corer                     212
SCUBA                     174
multiple gears             86
trap                       41
other                      20
hook and line              12
pot                         5
Cp                          2
Jsl-I-3905                  1
South Pacific Ocean         1
trawl-otter                 1
camera - drop               1
GMST                        1
GMT                         1
Name: count, dtype: int64

It's impressive to see the level of reliance on the ROV in comparison to the other equipments utilized. It shows just how important remote robotics have been to the study of deep sea corals. Even just the utilization of submersible vessels is critical as the pressures of the deep ocean are too great for humans to survive. Scuba divers realy only account for a small subset of methodologies used.

In [4]:
# filter only species type
df = df[df.TaxonRank == 'species']

# gather and rename columns. Only keep relevant columns
df = df[['ScientificName', 'ObservationDate', 'latitude', 'longitude', 'DepthInMeters', 'SamplingEquipment']]
df.columns = ['sci_name', 'date', 'lat', 'lon', 'depth_m', 'equipment']

# convert to datetimes and convert lat/lon to numeric
df.date = pd.to_datetime(df.date, format='mixed')
df.lat = pd.to_numeric(df.lat, errors='coerce')
df.lon = pd.to_numeric(df.lon, errors='coerce')

# filter nan or None
df = df.dropna()
In [5]:
print('unique species: ', df['sci_name'].nunique())
print('newest date: ', df['date'].min())
print('oldest date: ', df['date'].max())
print('northenmost point: ', df['lat'].max())
print('southernmost point: ', df['lat'].min())

dfcumsum = df.sci_name.value_counts().cumsum()

# what top X species make up 95% of the observations?
n95 = dfcumsum.searchsorted(df.shape[0] * 0.95)
print(f'top {n95} species make up 95% of the observations')

n50 = dfcumsum.searchsorted(df.shape[0] * 0.5)
print(f'top {n50} species make up 50% of the observations')
unique species:  1452
newest date:  1868-05-04 00:00:00
oldest date:  2016-03-27 00:00:00
northenmost point:  72.32
southernmost point:  -78.4
top 214 species make up 95% of the observations
top 10 species make up 50% of the observations

Even though there are almost 1500 different species, only 214 make up 95% of the observations. This means that while some species are very rare, some are very common.

Furthermore, The dataset goes all the way back to 1868, long before the invention of the ROV and even modern scuba equipment. We will see in visualizations later that the majority of the data is from the last few decades.

Data Visualization¶

We are now going to take a look into the dataset we have and visualize it.

Time Range¶

First, let's take a look at the time range for when this dataset was capture. We can utilize a histogram totally the number of observations per year.

Top Most Common Species¶

Second, lets take a look at what the most common species are. We know after initialize anaylisis that the top 10 species make up almost 50% of the dataset. We will plot the top 10 species and their counts.

Geographical Location¶

Third, taking the top 10 species as reference, we will plot the locations that they have been found in the world. We will plot the latitude and longitude of each observation to get a general idea of where the species are located and potentially what clusters of species are located in the same area.

Depth Location¶

Finally, using the depth measurments, we can build a histogram showing the distribution of the top 10 species in the water column.

In [6]:
max_year = df['date'].max().year
min_year = df['date'].min().year
bins = max_year - min_year

n95 = df['date'].quantile(0.05)
print('95% of the observations were made after', n95.year)

# histogram of observations over date
plt.figure(figsize=(20, 5))
sns.histplot(data=df, x='date', bins=bins, color='blue')
plt.axvline(n95, color='red', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Total Observations')
plt.show()
95% of the observations were made after 1985
No description has been provided for this image

After 1985, the number of observations increases dramatically. This is likely due to the increase in technology and the ability to explore the deep sea. And, given the fact that we actually have the methodology used to capture, we can actually see what methologies start to be utilized and where in time.

In [7]:
species = df.groupby('sci_name').size().sort_values(ascending=False)

fig, axs = plt.subplots(figsize=(20, 10))
sns.barplot(x=species[:10].index, y=species[:10].values)
plt.xticks(rotation=45)
for bars in axs.containers:
    axs.bar_label(bars)
plt.title('Top 10 Species Observed')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()
No description has been provided for this image

The top most common species has been observed double to that of the second most common species. This either could indicate a bias in the dataset or that this species may very way have adapted to it's environment much better than the other species.

In [8]:
fig, ax = plt.subplots(figsize=(10, 10))
countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
countries.plot(ax=ax, color='white', edgecolor='black')

topdf = df[df['sci_name'].isin(species[:10].index)]

sns.scatterplot(
    data=topdf,
    x='lon',
    y='lat',
    hue='sci_name',
    s=20,
    edgecolor='black',
)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
No description has been provided for this image

We clearly see that a lot of deep sea coral studies were conducted in the northern hemisphere along the west and east coasts of the US and Canada. Although, there is some hotspots in Antarctica and Hawaii. Many studies were also conducted in the Gulfs of Mexico and along the island chains of Alaska.

In [9]:
top_species = species[:10].index

topdf = df[df['sci_name'].isin(top_species)]

fig, ax = plt.subplots(figsize=(20, 25))
fig.tight_layout(pad=3.0)

for index, spec in enumerate(list(top_species)):
    ax = plt.subplot(5, 2, index + 1)
    sdf = topdf[topdf.sci_name == spec]
    sns.histplot(data=sdf, x='depth_m', bins=20, color='blue')
    plt.xlabel('Depth (m)')
    plt.ylabel('Total Observations')
    plt.title(f'Depth of {spec}. Total Observations: {sdf.shape[0]}')
No description has been provided for this image

Most of the top 10 species can be found in a general depth range relative to themselves. With the exception of Swiftia Pacifica which appears to be found in two distinct depth ranges.

Trissopathes pseudotristicha also appears to be found at much greater depths than the other species.

Hypothesis Testing¶

We can now start to ask some questions about the data and see if we can answer them. Some hypothesis we can test for are:

  • Ha = The species Heteropolypus Ritteri is found at depths between 200 and 2000 meters.

    Ho = The species Heteropolypus Ritteri is not found at depths between 200 and 2000 meters.

  • Ha = The species Heteropolypus Ritteri is found above latitude 30 degrees.

    Ho = The species Heteropolypus Ritteri is not found above latitude 30 degrees.

  • Ha = The species Heteropolypus Ritteri is found in the Atlantic Ocean.

    Ho = The species Heteropolypus Ritteri is not found in the Atlantic Ocean.

Let's look at each one of these hypothesis taking the p-value to be 0.05.

In [10]:
hr_df = df[df.sci_name == 'Heteropolypus ritteri']

# percentage of observations between 500 and 1000 meters
pha = hr_df[(hr_df.depth_m >= 200) & (hr_df.depth_m <= 2000)].shape[0] / hr_df.shape[0]
pho = hr_df[(hr_df.depth_m < 200) | (hr_df.depth_m > 2000)].shape[0] / hr_df.shape[0]

print('The hypothesis is: The species Heteropolypus Ritteri is found at depths between 200 and 2000 meters.')

print('Percentage of observations between 500 and 1000 meters: ', pha)
print('Percentage of observations outside of 500 and 1000 meters: ', pho)

if pho < 0.05:
    print('We can reject the null hypothesis that the observations are not found at depths between 200 and 2000 meters')
else:
    print('We accept the null hypothesis that the observations are not found at depths between 200 and 2000 meters')
The hypothesis is: The species Heteropolypus Ritteri is found at depths between 200 and 2000 meters.
Percentage of observations between 500 and 1000 meters:  0.9682909769176964
Percentage of observations outside of 500 and 1000 meters:  0.03170902308230357
We can reject the null hypothesis that the observations are not found at depths between 200 and 2000 meters
In [11]:
hr_df = df[df.sci_name == 'Heteropolypus ritteri']

# percentage of observations above latitude 30
pha = hr_df[hr_df.lat >= 30].shape[0] / hr_df.shape[0]
pho = hr_df[hr_df.lat < 30].shape[0] / hr_df.shape[0]

print('The hypothesis is: TThe species Heteropolypus Ritteri is found above latitude 30 degrees')

print('Percentage of observations above latitude 30: ', pha)
print('Percentage of observations below latitude 30: ', pho)

if pho < 0.05:
    print('We can reject the null hypothesis that the observations are not found above latitude 30 degrees')
else:
    print('We accept the null hypothesis that the observations are not found above latitude 30 degrees')
The hypothesis is: TThe species Heteropolypus Ritteri is found above latitude 30 degrees
Percentage of observations above latitude 30:  1.0
Percentage of observations below latitude 30:  0.0
We can reject the null hypothesis that the observations are not found above latitude 30 degrees
In [12]:
hr_df = df[df.sci_name == 'Heteropolypus ritteri']

# We are utilizing a very niave test here. This is for demonstration purposes only.
# lon between 60W and 0, lat between 50N and 50S

# percentage of observations found in Atlantic Ocean
pha = hr_df[(hr_df.lon >= -60) & (hr_df.lon <= 0) & (hr_df.lat >= -50) & (hr_df.lat <= 50)].shape[0] / hr_df.shape[0]
pho = hr_df[(hr_df.lon < -60) | (hr_df.lon > 0) | (hr_df.lat < -50) | (hr_df.lat > 50)].shape[0] / hr_df.shape[0]

print('The hypothesis is: The species Heteropolypus Ritteri is found in the Atlantic Ocean')

print('Percentage of observations found in Atlantic Ocean: ', pha)
print('Percentage of observations found outside of Atlantic Ocean: ', pho)

if pho < 0.05:
    print('We can reject the null hypothesis that the observations are not found in the Atlantic Ocean')
else:
    print('We accept the null hypothesis that the observations are not found in the Atlantic Ocean')
The hypothesis is: The species Heteropolypus Ritteri is found in the Atlantic Ocean
Percentage of observations found in Atlantic Ocean:  0.0
Percentage of observations found outside of Atlantic Ocean:  1.0
We accept the null hypothesis that the observations are not found in the Atlantic Ocean

Final Remarks¶

This dataset is very informative given the lengths to which the dataset was recorded. As I personally know one of the scientist working to research the reef system on behave of NOAA, I can say that this dataset took many years of exhausting effort to collect.

In terms of it's cleanliness, very little formatted was needed to be able to utilize it. Most of the cleaning was simple filtering to only gather what was needed for the anaylisis.

There are so many questions that can be asked from this dataset. Some questions that come to mind are: Why are certain species so much more common than others? Is it some sort of genetic advantange? Are they more resilient to the changing environment? If we had more data on the water quality, perhaps we could know more about why these creatures thrive where others do not.

It would be critical to be able to ask why these species are so common where others are not. It could help perdict changes to the ecosystem: if one species is rapidly declining, perhaps it's a sign that the ecosystem is changing and other species will follow. Or, if one species is rapidly increasing, it might be due to another species declining and the ecosystem is changing.