Deep Sea Coral Analysis¶
This dataset comes from NOAA and contains information specifically about deep sea corals world wide. More technical information can be found here: https://www.kaggle.com/datasets/noaa/deep-sea-corals at the Kaggle page.
Exploration¶
During the exploration of this dataset we will focus mainly on locality questions: what species are located where in the water column and geographically in the world?
The general steps taken will be as follows:
Load and clean the dataset. Remove any columns that are not needed for the analysis and remove any rows that have missing data. Format any of the columns to dates or numbers as needed. We will wait to do any encoding until after we have explored the dataset further.
Explore the data by plotting the data in various ways to see what the data looks like. We want to anaylze what time ranges we are working with, where geographically the dataset represents, how many species are recorded, and how many records we have for each species.
Analyze only a subset of species. Perhaps the top 10 most common species or the top 50% of species most commonly recorded in the dataset.
import warnings
# disable future,user,depreciation warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
df = pd.read_csv('deep_sea_corals.csv')
df.info()
/tmp/ipykernel_57624/462335268.py:1: DtypeWarning: Columns (5,7,8,13) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv('deep_sea_corals.csv')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 513373 entries, 0 to 513372 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CatalogNumber 513372 non-null float64 1 DataProvider 513372 non-null object 2 ScientificName 513372 non-null object 3 VernacularNameCategory 513197 non-null object 4 TaxonRank 513364 non-null object 5 Station 253590 non-null object 6 ObservationDate 513367 non-null object 7 latitude 513373 non-null object 8 longitude 513373 non-null object 9 DepthInMeters 513372 non-null float64 10 DepthMethod 496845 non-null object 11 Locality 389645 non-null object 12 LocationAccuracy 484662 non-null object 13 SurveyID 306228 non-null object 14 Repository 496584 non-null object 15 IdentificationQualifier 488591 non-null object 16 EventID 472141 non-null object 17 SamplingEquipment 485883 non-null object 18 RecordType 501077 non-null object 19 SampleID 402294 non-null object dtypes: float64(2), object(18) memory usage: 78.3+ MB
print('unique taxon identifiers: ', df.TaxonRank.unique())
print('sampling equipment utilized: ', df.SamplingEquipment.value_counts())
unique taxon identifiers: [nan 'species' 'genus' 'phylum' 'order' 'family' 'suborder' 'subgenus' 'subspecies' 'variety' 'class' 'forma' 'subfamily' 'subclass'] sampling equipment utilized: SamplingEquipment ROV 326289 submersible 70268 trawl 51899 towed camera 19626 longline 9481 dredge 2840 AUV 2535 drop camera 1262 grab 621 net 504 corer 212 SCUBA 174 multiple gears 86 trap 41 other 20 hook and line 12 pot 5 Cp 2 Jsl-I-3905 1 South Pacific Ocean 1 trawl-otter 1 camera - drop 1 GMST 1 GMT 1 Name: count, dtype: int64
It's impressive to see the level of reliance on the ROV in comparison to the other equipments utilized. It shows just how important remote robotics have been to the study of deep sea corals. Even just the utilization of submersible vessels is critical as the pressures of the deep ocean are too great for humans to survive. Scuba divers realy only account for a small subset of methodologies used.
# filter only species type
df = df[df.TaxonRank == 'species']
# gather and rename columns. Only keep relevant columns
df = df[['ScientificName', 'ObservationDate', 'latitude', 'longitude', 'DepthInMeters', 'SamplingEquipment']]
df.columns = ['sci_name', 'date', 'lat', 'lon', 'depth_m', 'equipment']
# convert to datetimes and convert lat/lon to numeric
df.date = pd.to_datetime(df.date, format='mixed')
df.lat = pd.to_numeric(df.lat, errors='coerce')
df.lon = pd.to_numeric(df.lon, errors='coerce')
# filter nan or None
df = df.dropna()
print('unique species: ', df['sci_name'].nunique())
print('newest date: ', df['date'].min())
print('oldest date: ', df['date'].max())
print('northenmost point: ', df['lat'].max())
print('southernmost point: ', df['lat'].min())
dfcumsum = df.sci_name.value_counts().cumsum()
# what top X species make up 95% of the observations?
n95 = dfcumsum.searchsorted(df.shape[0] * 0.95)
print(f'top {n95} species make up 95% of the observations')
n50 = dfcumsum.searchsorted(df.shape[0] * 0.5)
print(f'top {n50} species make up 50% of the observations')
unique species: 1452 newest date: 1868-05-04 00:00:00 oldest date: 2016-03-27 00:00:00 northenmost point: 72.32 southernmost point: -78.4 top 214 species make up 95% of the observations top 10 species make up 50% of the observations
Even though there are almost 1500 different species, only 214 make up 95% of the observations. This means that while some species are very rare, some are very common.
Furthermore, The dataset goes all the way back to 1868, long before the invention of the ROV and even modern scuba equipment. We will see in visualizations later that the majority of the data is from the last few decades.
Data Visualization¶
We are now going to take a look into the dataset we have and visualize it.
Time Range¶
First, let's take a look at the time range for when this dataset was capture. We can utilize a histogram totally the number of observations per year.
Top Most Common Species¶
Second, lets take a look at what the most common species are. We know after initialize anaylisis that the top 10 species make up almost 50% of the dataset. We will plot the top 10 species and their counts.
Geographical Location¶
Third, taking the top 10 species as reference, we will plot the locations that they have been found in the world. We will plot the latitude and longitude of each observation to get a general idea of where the species are located and potentially what clusters of species are located in the same area.
Depth Location¶
Finally, using the depth measurments, we can build a histogram showing the distribution of the top 10 species in the water column.
max_year = df['date'].max().year
min_year = df['date'].min().year
bins = max_year - min_year
n95 = df['date'].quantile(0.05)
print('95% of the observations were made after', n95.year)
# histogram of observations over date
plt.figure(figsize=(20, 5))
sns.histplot(data=df, x='date', bins=bins, color='blue')
plt.axvline(n95, color='red', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Total Observations')
plt.show()
95% of the observations were made after 1985
After 1985, the number of observations increases dramatically. This is likely due to the increase in technology and the ability to explore the deep sea. And, given the fact that we actually have the methodology used to capture, we can actually see what methologies start to be utilized and where in time.
species = df.groupby('sci_name').size().sort_values(ascending=False)
fig, axs = plt.subplots(figsize=(20, 10))
sns.barplot(x=species[:10].index, y=species[:10].values)
plt.xticks(rotation=45)
for bars in axs.containers:
axs.bar_label(bars)
plt.title('Top 10 Species Observed')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()
The top most common species has been observed double to that of the second most common species. This either could indicate a bias in the dataset or that this species may very way have adapted to it's environment much better than the other species.
fig, ax = plt.subplots(figsize=(10, 10))
countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
countries.plot(ax=ax, color='white', edgecolor='black')
topdf = df[df['sci_name'].isin(species[:10].index)]
sns.scatterplot(
data=topdf,
x='lon',
y='lat',
hue='sci_name',
s=20,
edgecolor='black',
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
We clearly see that a lot of deep sea coral studies were conducted in the northern hemisphere along the west and east coasts of the US and Canada. Although, there is some hotspots in Antarctica and Hawaii. Many studies were also conducted in the Gulfs of Mexico and along the island chains of Alaska.
top_species = species[:10].index
topdf = df[df['sci_name'].isin(top_species)]
fig, ax = plt.subplots(figsize=(20, 25))
fig.tight_layout(pad=3.0)
for index, spec in enumerate(list(top_species)):
ax = plt.subplot(5, 2, index + 1)
sdf = topdf[topdf.sci_name == spec]
sns.histplot(data=sdf, x='depth_m', bins=20, color='blue')
plt.xlabel('Depth (m)')
plt.ylabel('Total Observations')
plt.title(f'Depth of {spec}. Total Observations: {sdf.shape[0]}')
Most of the top 10 species can be found in a general depth range relative to themselves. With the exception of Swiftia Pacifica which appears to be found in two distinct depth ranges.
Trissopathes pseudotristicha also appears to be found at much greater depths than the other species.
Hypothesis Testing¶
We can now start to ask some questions about the data and see if we can answer them. Some hypothesis we can test for are:
Ha = The species Heteropolypus Ritteri is found at depths between 200 and 2000 meters.
Ho = The species Heteropolypus Ritteri is not found at depths between 200 and 2000 meters.
Ha = The species Heteropolypus Ritteri is found above latitude 30 degrees.
Ho = The species Heteropolypus Ritteri is not found above latitude 30 degrees.
Ha = The species Heteropolypus Ritteri is found in the Atlantic Ocean.
Ho = The species Heteropolypus Ritteri is not found in the Atlantic Ocean.
Let's look at each one of these hypothesis taking the p-value to be 0.05
.
hr_df = df[df.sci_name == 'Heteropolypus ritteri']
# percentage of observations between 500 and 1000 meters
pha = hr_df[(hr_df.depth_m >= 200) & (hr_df.depth_m <= 2000)].shape[0] / hr_df.shape[0]
pho = hr_df[(hr_df.depth_m < 200) | (hr_df.depth_m > 2000)].shape[0] / hr_df.shape[0]
print('The hypothesis is: The species Heteropolypus Ritteri is found at depths between 200 and 2000 meters.')
print('Percentage of observations between 500 and 1000 meters: ', pha)
print('Percentage of observations outside of 500 and 1000 meters: ', pho)
if pho < 0.05:
print('We can reject the null hypothesis that the observations are not found at depths between 200 and 2000 meters')
else:
print('We accept the null hypothesis that the observations are not found at depths between 200 and 2000 meters')
The hypothesis is: The species Heteropolypus Ritteri is found at depths between 200 and 2000 meters. Percentage of observations between 500 and 1000 meters: 0.9682909769176964 Percentage of observations outside of 500 and 1000 meters: 0.03170902308230357 We can reject the null hypothesis that the observations are not found at depths between 200 and 2000 meters
hr_df = df[df.sci_name == 'Heteropolypus ritteri']
# percentage of observations above latitude 30
pha = hr_df[hr_df.lat >= 30].shape[0] / hr_df.shape[0]
pho = hr_df[hr_df.lat < 30].shape[0] / hr_df.shape[0]
print('The hypothesis is: TThe species Heteropolypus Ritteri is found above latitude 30 degrees')
print('Percentage of observations above latitude 30: ', pha)
print('Percentage of observations below latitude 30: ', pho)
if pho < 0.05:
print('We can reject the null hypothesis that the observations are not found above latitude 30 degrees')
else:
print('We accept the null hypothesis that the observations are not found above latitude 30 degrees')
The hypothesis is: TThe species Heteropolypus Ritteri is found above latitude 30 degrees Percentage of observations above latitude 30: 1.0 Percentage of observations below latitude 30: 0.0 We can reject the null hypothesis that the observations are not found above latitude 30 degrees
hr_df = df[df.sci_name == 'Heteropolypus ritteri']
# We are utilizing a very niave test here. This is for demonstration purposes only.
# lon between 60W and 0, lat between 50N and 50S
# percentage of observations found in Atlantic Ocean
pha = hr_df[(hr_df.lon >= -60) & (hr_df.lon <= 0) & (hr_df.lat >= -50) & (hr_df.lat <= 50)].shape[0] / hr_df.shape[0]
pho = hr_df[(hr_df.lon < -60) | (hr_df.lon > 0) | (hr_df.lat < -50) | (hr_df.lat > 50)].shape[0] / hr_df.shape[0]
print('The hypothesis is: The species Heteropolypus Ritteri is found in the Atlantic Ocean')
print('Percentage of observations found in Atlantic Ocean: ', pha)
print('Percentage of observations found outside of Atlantic Ocean: ', pho)
if pho < 0.05:
print('We can reject the null hypothesis that the observations are not found in the Atlantic Ocean')
else:
print('We accept the null hypothesis that the observations are not found in the Atlantic Ocean')
The hypothesis is: The species Heteropolypus Ritteri is found in the Atlantic Ocean Percentage of observations found in Atlantic Ocean: 0.0 Percentage of observations found outside of Atlantic Ocean: 1.0 We accept the null hypothesis that the observations are not found in the Atlantic Ocean
Final Remarks¶
This dataset is very informative given the lengths to which the dataset was recorded. As I personally know one of the scientist working to research the reef system on behave of NOAA, I can say that this dataset took many years of exhausting effort to collect.
In terms of it's cleanliness, very little formatted was needed to be able to utilize it. Most of the cleaning was simple filtering to only gather what was needed for the anaylisis.
There are so many questions that can be asked from this dataset. Some questions that come to mind are: Why are certain species so much more common than others? Is it some sort of genetic advantange? Are they more resilient to the changing environment? If we had more data on the water quality, perhaps we could know more about why these creatures thrive where others do not.
It would be critical to be able to ask why these species are so common where others are not. It could help perdict changes to the ecosystem: if one species is rapidly declining, perhaps it's a sign that the ecosystem is changing and other species will follow. Or, if one species is rapidly increasing, it might be due to another species declining and the ecosystem is changing.