AutoFarm - Anaylsis of Soil Moisture Part 2¶
In continuation, we are trying to predict the soil moisture over time in between the watering times. We'd like to see if there is some pattern to the evaporation rate of the moisture, so we may be able to predict when the soil should be watered next!
One of the key factors that we were missing before was some feature that models the intensity or amount of sunlight. We could redo the experiment with somelike like a BH1750 light sensor. This sensor can actually output lumen values, which can be used to estimate the intensity of sunlight. But, we did take timelapse images of the plants themselves. I bet that we can use these images, extract the brightness from each image, and use that as a proxy for the sunlight intensity. Let's try it!
First, let's load our data and organize it a bit.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
dht11_df = pd.read_csv('auto_farm_dht11.csv')
ds18b20_df = pd.read_csv('auto_farm_ds18b20.csv')
soil_df = pd.read_csv('auto_farm_soil.csv')
watering_df = pd.read_csv('soil_water_times.csv')
dht11_df['timestamp'] = pd.to_datetime(dht11_df['timestamp'])
ds18b20_df['timestamp'] = pd.to_datetime(ds18b20_df['timestamp'])
soil_df['timestamp'] = pd.to_datetime(soil_df['timestamp'])
watering_df['timestamp'] = pd.to_datetime(watering_df['timestamp'] / 1000, unit='s')
# filter out data before the first watering
min_watering = watering_df['timestamp'].min()
dht11_df = dht11_df[dht11_df['timestamp'] > min_watering]
ds18b20_df = ds18b20_df[ds18b20_df['timestamp'] > min_watering]
soil_df = soil_df[soil_df['timestamp'] > min_watering]
# normalize the soil moisture values from 0-1 for each sensor
scaler = MinMaxScaler()
for sensor, group in soil_df.groupby('sensor'):
soil_df.loc[soil_df['sensor'] == sensor, 'soil_moisture'] = scaler.fit_transform(group[['soil_moisture']])
# normalize the temperature and humidity values from 0-1 for each sensor
for sensor, group in dht11_df.groupby('sensor'):
dht11_df.loc[dht11_df['sensor'] == sensor, 'air_temperature_2'] = scaler.fit_transform(group[['air_temperature_2']])
dht11_df.loc[dht11_df['sensor'] == sensor, 'air_humidity'] = scaler.fit_transform(group[['air_humidity']])
# normalize the temperature values from 0-1 for each sensor
for sensor, group in ds18b20_df.groupby('sensor'):
ds18b20_df.loc[ds18b20_df['sensor'] == sensor, 'air_temperature_1'] = scaler.fit_transform(group[['air_temperature_1']])
# pivot tables into columns for each sensor
soil_df = soil_df.pivot(index='timestamp', columns='sensor', values='soil_moisture')
soil_df = soil_df.rename(columns={0: 'soil_moisture_0', 1: 'soil_moisture_1', 2: 'soil_moisture_2', 3: 'soil_moisture_3'})
dht11_df = dht11_df.drop(columns=['id', 'sensor'])
dht11_df = dht11_df.rename(columns={'air_temperature_2': 'dht11_air_temp', 'air_humidity': 'dht11_air_humid'})
ds18b20_df = ds18b20_df.pivot(index='timestamp', columns='sensor', values='air_temperature_1')
ds18b20_df = ds18b20_df.rename(columns={0: 'ds18b20_air_temp_0', 1: 'ds18b20_air_temp_1'})
# merge all the data together
sdf = pd.merge(dht11_df, ds18b20_df, on=['timestamp'])
sdf = pd.merge(sdf, soil_df, on=['timestamp'])
sdf = sdf.dropna()
# cut the data up into the times that the plant was watered
cut_bins = sorted(watering_df.timestamp.values)
sdf['watering'] = pd.cut(sdf['timestamp'], cut_bins)
# for each cut, trim out the first and last 30 minutes of data
# this reduces a little bit of noise from the watering event
trimmed = []
delta = pd.Timedelta(minutes=30)
for i, group in sdf.groupby('watering', observed=False):
group = group[(group['timestamp'] > i.left + delta) & (group['timestamp'] < i.right - delta)]
trimmed.append(group)
sdf = pd.concat(trimmed)
# average all the soil moisture values for each watering event
sdf['avg_soil_moisture'] = sdf[['soil_moisture_0', 'soil_moisture_1', 'soil_moisture_2', 'soil_moisture_3']].mean(axis=1)
sdf
timestamp | dht11_air_temp | dht11_air_humid | ds18b20_air_temp_0 | ds18b20_air_temp_1 | soil_moisture_0 | soil_moisture_1 | soil_moisture_2 | soil_moisture_3 | watering | avg_soil_moisture | |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 2020-10-10 07:30:00 | 0.566594 | 0.455660 | 0.340871 | 0.253344 | 0.518050 | 0.654989 | 0.555152 | 0.536055 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.566062 |
4 | 2020-10-10 07:40:00 | 0.566594 | 0.464370 | 0.321607 | 0.240377 | 0.545316 | 0.661136 | 0.569981 | 0.544623 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.580264 |
5 | 2020-10-10 07:50:00 | 0.566594 | 0.531965 | 0.307521 | 0.230267 | 0.566188 | 0.667435 | 0.584388 | 0.551945 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.592489 |
6 | 2020-10-10 08:00:00 | 0.565500 | 0.544652 | 0.305148 | 0.225797 | 0.578541 | 0.673849 | 0.596488 | 0.560237 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.602279 |
7 | 2020-10-10 08:10:00 | 0.562802 | 0.544652 | 0.296751 | 0.218988 | 0.588930 | 0.677521 | 0.607590 | 0.567752 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.610448 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4015 | 2020-11-08 01:10:00 | 0.605499 | 0.844283 | 0.367640 | 0.297473 | 0.847433 | 0.585733 | 0.307579 | 0.531594 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.568085 |
4016 | 2020-11-08 01:20:00 | 0.568497 | 0.900890 | 0.371498 | 0.301758 | 0.855966 | 0.591959 | 0.310458 | 0.533064 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.572862 |
4017 | 2020-11-08 01:30:00 | 0.566594 | 0.908930 | 0.378628 | 0.305212 | 0.864768 | 0.597512 | 0.314700 | 0.538643 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.578906 |
4018 | 2020-11-08 01:40:00 | 0.566594 | 0.908930 | 0.380732 | 0.305992 | 0.872090 | 0.601510 | 0.316665 | 0.541868 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.583034 |
4019 | 2020-11-08 01:50:00 | 0.566594 | 0.908930 | 0.381174 | 0.306826 | 0.879349 | 0.605907 | 0.319791 | 0.544824 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.587468 |
3944 rows × 11 columns
import matplotlib.pyplot as plt
import seaborn as sns
features = ['dht11_air_temp', 'dht11_air_humid', 'ds18b20_air_temp_0',
'ds18b20_air_temp_1', 'avg_soil_moisture']
sns.set(style="whitegrid")
for watering, gdf in sdf.groupby('watering', observed=False):
gdf['post_watering_mins'] = (gdf['timestamp'] - watering.left).dt.total_seconds() / 60
plt.figure(figsize=(10, 5))
for feature in features:
sns.lineplot(x=gdf['post_watering_mins'], y=gdf[feature], label=feature)
plt.title(f'Watering Event: {watering.left}')
plt.show()
break
Now we can very clearly see all the data organized in one cohesive dataframe, split and normalized by the watering times. We didn't perform the pivot in the last notebook which may have been a mistake. Many duplicates would have gone into the linear regression model. But for sanity, we will redo the linear regression to see if there is much of a change.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
features = ['dht11_air_temp', 'dht11_air_humid', 'ds18b20_air_temp_0', 'ds18b20_air_temp_1']
fig, axs = plt.subplots(4, 4, figsize=(15, 10))
axs = [item for sublist in axs for item in sublist]
fig.tight_layout(h_pad=4)
lin_df = []
index = 0
for watering, gdf in sdf.groupby('watering', observed=False):
gdf['post_watering_mins'] = (gdf['timestamp'] - watering.left).dt.total_seconds() / 60
# sort by post_watering_mins
gdf = gdf.sort_values(by='post_watering_mins')
x = gdf[['post_watering_mins']]
y = gdf['avg_soil_moisture']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = LinearRegression(n_jobs=-1)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
r2 = r2_score(y_test, y_pred)
# set title
axs[index].set_title(f'R2: {r2:.3f}')
lin_df.append({
'score': r2,
'intercept': model.intercept_,
'slope': model.coef_[0],
'watering_time': watering.left
})
for feature in features:
sns.lineplot(x='post_watering_mins', y=feature, data=gdf, ax=axs[index], color='green')
sns.lineplot(x='post_watering_mins', y='avg_soil_moisture', data=gdf, ax=axs[index], color='blue')
sns.lineplot(x=x_test['post_watering_mins'], y=y_pred, ax=axs[index], color='red')
axs[index].xaxis.set_major_locator(plt.MaxNLocator(3))
axs[index].set_ylabel('')
axs[index].set_xlabel('')
index += 1
lin_df = pd.DataFrame(lin_df)
lin_df
score | intercept | slope | watering_time | |
---|---|---|---|---|
0 | 0.274018 | 0.745632 | 0.000083 | 2020-10-10 06:57:00 |
1 | 0.981504 | 0.570525 | 0.000108 | 2020-10-11 23:53:00 |
2 | 0.977453 | 0.412539 | 0.000133 | 2020-10-13 20:26:00 |
3 | 0.961573 | 0.291451 | 0.000133 | 2020-10-15 18:20:00 |
4 | 0.975111 | 0.334323 | 0.000126 | 2020-10-17 18:28:00 |
5 | 0.985086 | 0.265375 | 0.000145 | 2020-10-19 20:03:00 |
6 | 0.991738 | 0.206295 | 0.000142 | 2020-10-21 22:45:00 |
7 | 0.989668 | 0.223837 | 0.000127 | 2020-10-24 02:55:00 |
8 | 0.990065 | 0.167465 | 0.000121 | 2020-10-26 06:50:00 |
9 | 0.996425 | 0.129523 | 0.000123 | 2020-10-29 06:36:00 |
10 | 0.977578 | 0.110587 | 0.000134 | 2020-10-31 19:33:00 |
11 | 0.965220 | 0.197110 | 0.000105 | 2020-11-03 17:53:00 |
12 | 0.988441 | 0.040270 | 0.000115 | 2020-11-04 20:34:00 |
# standard error of the slope
from sklearn.metrics import mean_squared_error
# remove first event as it is an outlier
df = lin_df.sort_values(by='watering_time')
df = df[1:]
y = df['slope']
median = y.median()
error = mean_squared_error(y, [median] * len(y))
plt.figure(figsize=(10, 3))
sns.lineplot(data=df, x='watering_time', y='slope')
plt.title('Slope of Regression Over Time | RMSE of Median: {:.2E}'.format(error))
plt.xlabel('Watering Time')
plt.ylabel('Slope')
plt.axhline(y=median, color='red', linestyle='--', linewidth=1)
plt.fill_between(df['watering_time'], median - error, median + error, alpha=0.2)
plt.show()
Except for the first watering cycle, which I discard as an outlier containing potentially erronous data, the relationship is very linear! The previous notebook was very incorrect as it contained many duplicated datapoints. This would have skewed the results greatly! Anaylzing the slope even further, we find that the rate of evaporation is actually extremely consistent. We would expect this to change if the experiment was run for a longer period of time as the plant grows even larger and the soil chemistry changes.
If we take the median value of the slopes from the linear regression, and use that as our new regression function, we get an incredibly low RMSE value. Incredible!
Further, we have actually shown that all other metrics (temperature and humidity) actually serve no real value in predicting the soil moisture. If we test a linear model only using these features, we see extremely bad results.
From this point, we could easily select a median value of the slope and use this as a key predictor value in when we should water the plant next. Nothing wrong with nearly 90% accuracy! But I wonder if this linear relationship can be even better predicted if we know the intensity of light. Maybe our temperature and humidity didn't flucuate enough to be useful, but the light intensity may provide a better signal.
import os
import json
import numpy as np
from tqdm import tqdm
import cv2
root_dir = '/home/jack/Mounts/DiskOne'
frame_df = []
def read_frames(resolution):
timestamps = json.load(open(os.path.join(root_dir, f'timestamps_{resolution}.json')))
capture = cv2.VideoCapture(os.path.join(root_dir, f'video_{resolution}.avi'))
capture.set(cv2.CAP_PROP_FPS, 0)
loader = tqdm(total=len(timestamps), desc=f'Reading {resolution}')
while True:
ret, frame = capture.read()
if not ret:
break
loader.update(1)
timestamp = timestamps.pop(0)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
brightness = np.mean(frame[:, :, 2])
frame_df.append({'timestamp': timestamp, 'brightness': brightness, 'resolution': resolution})
loader.close()
capture.release()
read_frames('640x480')
read_frames('1280x720')
read_frames('1920x1080')
frame_df = pd.DataFrame(frame_df)
Reading 640x480: 100%|██████████| 4960/4960 [00:28<00:00, 174.13it/s] Reading 1280x720: 100%|██████████| 26903/26903 [03:08<00:00, 142.84it/s] Reading 1920x1080: 100%|██████████| 20553/20553 [07:06<00:00, 48.16it/s]
frame_df = pd.read_parquet('frame_brightness.parquet')
frame_df['timestamp'] = pd.to_datetime(frame_df['timestamp'], unit='ms')
frame_df = frame_df[frame_df['timestamp'] > min_watering]
frame_df = frame_df[['timestamp', 'brightness']]
# aggregate every 10 minutes
frame_df = frame_df.set_index('timestamp').resample('10min').mean().reset_index()
scaler = MinMaxScaler()
frame_df['brightness'] = scaler.fit_transform(frame_df[['brightness']])
adf = pd.merge(sdf, frame_df, on=['timestamp'], how='left')
# set NA to 0. Most missing values are at night
adf['brightness'] = adf['brightness'].fillna(0)
adf
timestamp | dht11_air_temp | dht11_air_humid | ds18b20_air_temp_0 | ds18b20_air_temp_1 | soil_moisture_0 | soil_moisture_1 | soil_moisture_2 | soil_moisture_3 | watering | avg_soil_moisture | brightness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-10-10 07:30:00 | 0.566594 | 0.455660 | 0.340871 | 0.253344 | 0.518050 | 0.654989 | 0.555152 | 0.536055 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.566062 | 0.000000 |
1 | 2020-10-10 07:40:00 | 0.566594 | 0.464370 | 0.321607 | 0.240377 | 0.545316 | 0.661136 | 0.569981 | 0.544623 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.580264 | 0.000000 |
2 | 2020-10-10 07:50:00 | 0.566594 | 0.531965 | 0.307521 | 0.230267 | 0.566188 | 0.667435 | 0.584388 | 0.551945 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.592489 | 0.000000 |
3 | 2020-10-10 08:00:00 | 0.565500 | 0.544652 | 0.305148 | 0.225797 | 0.578541 | 0.673849 | 0.596488 | 0.560237 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.602279 | 0.000000 |
4 | 2020-10-10 08:10:00 | 0.562802 | 0.544652 | 0.296751 | 0.218988 | 0.588930 | 0.677521 | 0.607590 | 0.567752 | (2020-10-10 06:57:00, 2020-10-11 23:53:00] | 0.610448 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3939 | 2020-11-08 01:10:00 | 0.605499 | 0.844283 | 0.367640 | 0.297473 | 0.847433 | 0.585733 | 0.307579 | 0.531594 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.568085 | 0.636386 |
3940 | 2020-11-08 01:20:00 | 0.568497 | 0.900890 | 0.371498 | 0.301758 | 0.855966 | 0.591959 | 0.310458 | 0.533064 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.572862 | 0.632836 |
3941 | 2020-11-08 01:30:00 | 0.566594 | 0.908930 | 0.378628 | 0.305212 | 0.864768 | 0.597512 | 0.314700 | 0.538643 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.578906 | 0.629950 |
3942 | 2020-11-08 01:40:00 | 0.566594 | 0.908930 | 0.380732 | 0.305992 | 0.872090 | 0.601510 | 0.316665 | 0.541868 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.583034 | 0.629669 |
3943 | 2020-11-08 01:50:00 | 0.566594 | 0.908930 | 0.381174 | 0.306826 | 0.879349 | 0.605907 | 0.319791 | 0.544824 | (2020-11-04 20:34:00, 2020-11-08 03:33:00] | 0.587468 | 0.632244 |
3944 rows × 12 columns
plt.figure(figsize=(20, 5))
sns.lineplot(data=adf, x='timestamp', y='brightness')
plt.title('Brightness Over Time')
plt.xlabel('Timestamp')
plt.ylabel('Brightness')
plt.show()
And done. Another column that measures the brightness. Generally speaking, the brightness remains fairly consistent through the day, with some spikes occuring sometime in the morning. Typically, this was because the sun just peaks out from the mountain and shines directly onto the plants. This would only last a few minutes before it went to normal levels again. But we can test if the brightness has any correlation with the soil moisture levels!
def test_model(x, y):
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
r2 = r2_score(y_test, y_pred)
return r2
ldf = []
index = 0
for watering, gdf in adf.groupby('watering', observed=False):
gdf['post_watering_mins'] = (gdf['timestamp'] - watering.left).dt.total_seconds() / 60
# sort by post_watering_mins
gdf = gdf.sort_values(by='post_watering_mins')
y = gdf['avg_soil_moisture']
score1 = test_model(gdf[['post_watering_mins']], y)
score2 = test_model(gdf[['post_watering_mins', 'brightness']], y)
ldf.append({
'watering_time': watering.left,
'score1': score1,
'score2': score2
})
index += 1
ldf = pd.DataFrame(ldf)
ldf['diff'] = ldf['score2'] - ldf['score1']
ldf
watering_time | score1 | score2 | diff | |
---|---|---|---|---|
0 | 2020-10-10 06:57:00 | 0.374000 | 0.637955 | 0.263955 |
1 | 2020-10-11 23:53:00 | 0.974990 | 0.975284 | 0.000294 |
2 | 2020-10-13 20:26:00 | 0.982043 | 0.980331 | -0.001713 |
3 | 2020-10-15 18:20:00 | 0.937574 | 0.899448 | -0.038126 |
4 | 2020-10-17 18:28:00 | 0.979950 | 0.983169 | 0.003219 |
5 | 2020-10-19 20:03:00 | 0.986530 | 0.987286 | 0.000756 |
6 | 2020-10-21 22:45:00 | 0.985104 | 0.990665 | 0.005561 |
7 | 2020-10-24 02:55:00 | 0.990534 | 0.979869 | -0.010666 |
8 | 2020-10-26 06:50:00 | 0.990324 | 0.989962 | -0.000362 |
9 | 2020-10-29 06:36:00 | 0.996234 | 0.993400 | -0.002834 |
10 | 2020-10-31 19:33:00 | 0.978275 | 0.981934 | 0.003658 |
11 | 2020-11-03 17:53:00 | 0.969186 | 0.947358 | -0.021829 |
12 | 2020-11-04 20:34:00 | 0.992849 | 0.991828 | -0.001021 |
Peering into out diff
column, we see that the impact of the brightness on the soil moisture is next to nothing. In fact, most of the time it leads to a negative impact (sightly reducing the r2 score). Very interesting indeed! The soil medium is so efficent at retaining moisture, that normal fluctuations in temperature, brightness, and humidity have very little, if any, impact on soil moisture levels.
We can now rest our souls and conclude that the relationship of soil moisture is a very linear function of time! And that the rate of evaporation over the course of a month remains very consistent.
Conclusion¶
This is part two of our analysis of soil moisture. I conducted this experiment back in 2020 during COVID-19 lockdowns when I was bored and had nothing to do. In the first part, I made major errors while wrangling the data which led to very incorrect conclusions about the soil moisture. It happens! And this is all part of the learning and building process. Projects like these are a great way to learn about data analysis, machine learning, and the importance of clean data. And while many blogs never really show the mistakes that were made along the way, rest assured, they are just not documented!
As for the results, we have shown that the soil moisture, during the first month of growth of the basil plant, is a linear function of time. The rate of evaporation remains consistent, and external factors such as temperature, humidity, and brightness have little to no impact on soil moisture levels. We could use the median slope of the regression model, apply a level for which we want to maintain our soil moisture, and predict when to water them next with a fairly high degree of accuracy: around 90%.
To follow, it's worth taking a look into the crops themselves. Timelapse photography was taken of the plants during the entire growout. Given the linear relationship of soil moisture, is growth also linear? Can we train an instance segmentation model to detect the plants and then measure their growth over time? This could be an interesting next analysis!