Spectrogram Generation with Gstreamer and Torch Audio¶
It's been nearly a year since I last posted a project. So much has happened in the world of AI and machine learning, and I've been absolutely consumed with learning and experimenting with these new technologies.
Specifically, I've been working on processing high frequency audio data. This high frequency data, taken from underwater hydrophones, is specialized to capture frequency beyond the range of human hearing. Specifically, marine mammals like dolphins and porpoises use these high frequency sounds for echolocation and communication.
The normal audio models like Wav2Vec2 are not designed to handle these high frequency sounds: they are trained on human speech recordings encoded at 16kHz. To effectively process these high frequency sounds, and get a model to actually understand them, we could turn to generating spectrograms.
Spectrograms are visual representations of audio signals. They allow us to visualize how frequencies change even if it is beyond human hearing. Using Fast Fourier Transform (FFT), the frequency is encoded in the y-axis, time in the x-axis, and amplitude is represented by color intensity.
In essence, a single spectrogram image actually contains temporal information. Considering it is a static image, we can use it to train classical convolutional neural networks (CNNs) to classify and analyze these high frequency sounds. We could also utilize CNN's in combination with LSTMs to capture both spatial and temporal features from the spectrograms. Or, we could even use transformer architectures to capture long-range dependencies in the spectrogram data.
All of these approaches are really cool, but what if we have a live audio stream? Wouldn't it be cool if we could generate a live stream (both audio and video) that shows the spectrogram in real-time as the audio is being captured? Would this help computer vision models have an easier time classifying the audio data?
Sometimes, just sometimes, we have to do a little bit of classical software engineering to make things happen.
The Source¶
There is a really cool project called OrcaSound that captures live audio from hydrophones placed in the ocean. They have a live stream that you can listen to here. Go check it out, seriously, it's super cool.
import cv2
import matplotlib.pyplot as plt
image = cv2.imread("orcasound.png")
plt.figure(figsize=(10, 4))
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.axis('off')
plt.show()
In fact, they actually have a short talk about the project and the machine learning they have already done with the data here.
Like a little kid playing in the browser, I inspected the network traffic and found the actual HLS playlist URL. I can then inspect this with ffprobe to see the audio format.
! ffprobe -v quiet -print_format json -show_format -show_streams https://audio-orcasound-net.s3.amazonaws.com/rpi_orcasound_lab/hls/1763452821/live.m3u8 | jq '{ codec: .streams[0].codec_name, sample_rate: .streams[0].sample_rate, channels: .streams[0].channels }'
{ "codec": "aac", "sample_rate": "48000", "channels": 2 }
oohhhh so cool. Now we are cooking! After inspecting the output of ffprobe, I can see that the audio stream is encoded as AAC with a sample rate of 48kHz and 2 channels (stereo). AAC is a really awesome lossy audio codec, built into the ffmpeg library, that provides good quality audio, at low bitrates, but with high frequency ranges! This is critical for capturing the range of frequencies used by marine mammals in their vocalizations. As stated before, normal human speech models are not designed to handle these high frequency sounds, so having a codec that can preserve these frequencies is essential for effective analysis and classification of the audio data.
Gstreamer¶
Gstreamer is so cool. I have fallen in love with it. For sometime, I actually really didn't like it. But that's just before I understood how to utilize it effectively. And with bindings in Python (and Rust my new favorite language), it makes it super easy to build complex multimedia pipelines.
What we essentially need to do is build a live streaming pipeline that captures the audio from the HLS stream, decodes it, and allows use to capture the audio samples in Python, generating spectrograms at a fixed interval, and then encoding the spectrogram images into a video stream that can be outputted somewhere! That somewhere could be some kind of MP4 file, RTSP stream, or even a local display window. For now, we will output to a video file so I can embed that into this notebook.
Let's start with utilizing the gst-launch-1.0 command line tool to prototype our pipeline.
gst-launch-1.0 -v souphttpsrc location="https://audio-orcasound-net.s3.amazonaws.com/rpi_orcasound_lab/hls/1763452821/live.m3u8" is-live=1 ! hlsdemux ! queue ! decodebin ! audioconvert ! autoaudiosink
Running this in the command line, indeed, we do hear the live audio stream from the OrcaSound hydrophone! Success! Now let's start scripting this in Python so we can capture the audio samples and generate spectrograms. I've added comments to the code to explain what each part does.
import numpy as np
import gi
gi.require_version('Gst', '1.0')
from gi.repository import Gst, GLib
Gst.init(None)
hls_source = "https://audio-orcasound-net.s3.amazonaws.com/rpi_orcasound_lab/hls/1763452821/live.m3u8"
source_pipeline = f'''
souphttpsrc location="{hls_source}" is-live=true \
! hlsdemux \
! queue \
! decodebin \
! audioconvert \
! audio/x-raw,channels=2 \
! deinterleave name=d \
d.src_0 ! \
audiobuffersplit output-buffer-duration=1/15 ! \
appsink name=read emit-signals=true
'''.strip()
loop = GLib.MainLoop.new(None, False)
source_pipeline = Gst.parse_launch(source_pipeline)
audio_source = source_pipeline.get_by_name("read")
def on_audio_sample(sink):
sample = sink.emit("pull-sample")
audio_buffer = sample.get_buffer()
caps = sample.get_caps()
success, map_info = audio_buffer.map(Gst.MapFlags.READ)
if not success:
return Gst.FlowReturn.ERROR
data = map_info.data
# tensor is a 1D numpy array of float32 audio samples
tensor = np.frombuffer(data, dtype=np.float32)
# TODO: Process the audio data (e.g., generate spectrogram)
print(f"caps: {caps.to_string()}")
print(f"tensor: {tensor.shape} {tensor.dtype} {tensor.min()} {tensor.max()}")
audio_buffer.unmap(map_info)
return Gst.FlowReturn.OK
def on_source_message(bus, message):
if message.type == Gst.MessageType.EOS:
print("End-Of-Stream reached.")
loop.quit()
elif message.type == Gst.MessageType.ERROR:
print(f"GStreamer Source Error: {message.parse_error()}")
loop.quit()
elif message.type == Gst.MessageType.WARNING:
print(f"GStreamer Source Warning: {message.parse_warning()}")
# setup a bus to be able to catch errors and gstreamer messages.
# this will come in handy later for debugging and also triggering
# end-of-stream events.
source_bus = source_pipeline.get_bus()
source_bus.add_signal_watch()
source_bus.connect("message", on_source_message)
# this is where the audio data will be received and we can handle it.
audio_source.connect("new-sample", lambda sink: on_audio_sample(sink))
# it's a live source, so set the pipeline to playing right away
source_pipeline.set_state(Gst.State.PLAYING)
try:
loop.run()
except KeyboardInterrupt:
print("Interrupted by user, stopping...")
except Exception:
print("Failed during processing")
finally:
source_pipeline.set_state(Gst.State.NULL)
Interrupted by user, stopping...
Now I couldn't get this to run in a notebook unfortunately. But that's okay. The output was a little something like this:
caps: audio/x-raw, format=(string)F32LE, layout=(string)interleaved, rate=(int)48000, channels=(int)1
tensor: (3200,) float32 -0.009153034538030624 0.008057080209255219
A few very important notes on the gstreamer application so far. For starters, we need to understand the caps information a bit more. The caps tell us that the audio format is F32LE, which means 32-bit floating point little-endian. This is great because it means we can directly convert the audio samples into a numpy array of float32 values. But the caps also tell us that the original data stream was non-interleaved stereo audio (2 channels). I'm not exactly sure why the hydrophone would be in stereo, unless it's simply to give the listener a sense of spatial audio. For now, I'm assuming we can work with a single channel (mono) audio stream for spectrogram generation. We added in the deinterleave element to the gstreamer pipeline to only capture a single channel.
The second important note is the shape of the tensor. souphttpsrc and hlsdemux are streaming the audio data in chunks. Each chunk will be determined by the internal buffering of the gstreamer elements but this doesn't nessesarily coorespond with the resulting video framerate we want to use in the resulting video. This is where we introduce the audiobuffersplit element. This will chunk up the buffers into fixed sizes that we can set to a given framerate. At the setting 1/15, this will result in 15 frames per second in the resulting video.
So now we have a live audio stream being captured in Python as numpy arrays of float32 audio samples at a fixed interval. Now, we can start to generate the spectograms. Generating the spectrograms though is going to be a bit tricky. A few things we need to consider:
- The sample rate of the audio stream is 48kHz. This means that we need to set the FFT parameters accordingly to capture enough resolution to see the high frequency sounds.
- The length of the audio samples captured in each chunk will determine the time resolution of the spectrogram. If we want to capture more temporal detail, we need a way to buffer the spectogram images over time.
- We need to convert the spectrogram into a format that can be encoded into a video stream. This means normalizing the data and converting it into an 8-bit image format.
- We need to consider the color mapping of the spectrogram. Typically, spectrograms are visualized using a colormap that maps amplitude values to colors. For simplicity, we can start with a grayscale representation where higher amplitudes are lighter and lower amplitudes are darker.
- For a standardized output, we should resize the spectrogram image to some standard resolution, like 640x480 pixels. But to maintain the aspect ratio to avoid distortion, we need to resize it properly and then crop it or pad it to fit the desired dimensions.
With these things considered, here is a bit of code to achieve this:
import torch as t
import torchaudio as ta
spec_transform = ta.transforms.Spectrogram(
n_fft=1024,
win_length=1024,
hop_length=512,
normalized=True,
power=2.0
)
power_to_db = ta.transforms.AmplitudeToDB("power", 150.0)
# inside the on_audio_sample function after converting to numpy array
tensor = t.from_numpy(tensor).float()
tensor = power_to_db(spec_transform(tensor))
# we shift the values to be in the range of 0-255 for image encoding
tensor += 100.0
tensor = (tensor / 150.0) * 255.0
tensor = tensor.numpy().astype(np.uint8)
# flip the spectrogram vertically so low frequencies are at the bottom
tensor = np.flipud(tensor)
I'm utilizing torchaudio to generate the spectrograms. Somewhat arbitrarily, I choose the given hyperparameters for it. We will adjust it on the fly later if needed. This gets us only a sliver of the total spectrogram that we want to generate however. We need to actually determine how long we want to buffer the spectrogram for, set that as a hyperparameter, and then calculate the size of the resulting spectrogram image.
spectrogram_duration = 5.0 # seconds
sample_rate = 48000 # Hz
n_fft = 1024
hop_length = 512
spectrogram_image_height = n_fft // 2 + 1
spectrogram_image_width = int((spectrogram_duration * sample_rate) / hop_length)
spectrogram_image = np.zeros((spectrogram_image_height, spectrogram_image_width), dtype=np.uint8)
This is definetely assuming we know the sample_rate ahead of time! If we want to be more dynamic (which we can), we can add a probe to the gstreamer sink to capture the caps information and extract the sample rate from there.
def on_audio_probe(pad, info, user_data):
event = info.get_event()
if event.type == Gst.EventType.CAPS:
caps = event.parse_caps()
print(f"Audio caps: {caps.to_string()}")
# once we have the caps, we can remove this probe.
return Gst.PadProbeReturn.REMOVE
return Gst.PadProbeReturn.PASS
audio_source\
.get_static_pad("sink")\
.add_probe(Gst.PadProbeType.EVENT_DOWNSTREAM, on_audio_probe, None)
Now that we have our buffer, we can start filling it in with the spectrogram slices we generate from each audio chunk.
# inside the on_audio_sample function after generating the spectrogram slice
# assuming spectrogram_image is a global variable
spectrogram_image[:, :-tensor.shape[1]] = spectrogram_image[:, tensor.shape[1]:]
spectrogram_image[:, -tensor.shape[1]:] = tensor
And boom! we are now generating a live spectrogram image from the audio stream. We will write the spectrogram image to a file with OpenCV for debugging...
import cv2
import matplotlib.pyplot as plt
image = cv2.imread("spectrogram_test.png")
plt.figure(figsize=(10, 4))
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.axis('off')
plt.show()
The artifacts on the top are from the compression artifacts of the AAC codec. Exploring different codecs, like Opus and MP3, I've seen different artifacts appear. Unfortunately, this is the reality of working with lossy audio codecs. But for now, this is good enough for prototyping. What we'd really like is a stream that has it's audio encoded with flac which is a lossless codec.
Now we need to write this data to another pipeline and encode it as a video stream! But we also want to write the audio buffers as well so that we have both audio and video in the resulting output.
# flacenc has to convert F32LE to S16LE so we need to set the caps on the audio sink
sink_pipeline = f'''
appsrc name=video emit-signals=true format=time is-live=true \
appsrc name=audio emit-signals=true format=time is-live=true \
matroskamux name=mux ! filesink location="output.mkv" \
video. \
! queue \
! videoconvert \
! video/x-raw,format=I420 \
! x264enc tune=zerolatency speed-preset=ultrafast \
! h264parse config-interval=-1 ! mux. \
audio. \
! queue \
! audioconvert \
! audio/x-raw,format=S16LE \
! flacenc \
! mux.
'''.strip()
sink_pipeline = Gst.parse_launch(sink_pipeline)
video_sink = sink_pipeline.get_by_name("video")
audio_sink = sink_pipeline.get_by_name("audio")
# it's very important that we set the caps for the video and audio sink correctly
# In this example, we know exactly what the video format will be so we can set it ahead of time
# But in a more dynamic application, we will need to utilize a probe to set the caps on the fly
video_caps = Gst.Caps.from_string(f"video/x-raw,format=GRAY8,width=640,height=480,framerate=15/1")
audio_caps = Gst.Caps.from_string("audio/x-raw,format=F32LE,layout=interleaved,rate=48000,channels=2")
video_sink.set_property('caps', video_caps)
audio_sink.set_property('caps', audio_caps)
####
# in the on_audio_sample function, after generating the spectrogram image
###
scale_w = 640 / spectrogram_image_width
scale_h = 480 / spectrogram_image_height
scale = min(scale_w, scale_h)
video_width = int(spectrogram_image_width * scale)
video_height = int(spectrogram_image_height * scale)
resized_spec = cv2.resize(spectrogram_image, (video_width, video_height))
# place resized_spec in the center of video_frame
y_offset = (480 - video_height) // 2
x_offset = (640 - video_width) // 2
video_frame[y_offset:y_offset+video_height, x_offset:x_offset+video_width] = resized_spec
# we utilize a global pts variable to keep track of the presentation timestamp
# for writing to the sinks
frame_buffer = Gst.Buffer.new_wrapped(video_frame.tobytes())
frame_buffer.pts = global_pts
frame_buffer.duration = buffer.duration
# the duration of the audio buffer is the same as the video buffer
# so we can reuse it here
audio_buffer = Gst.Buffer.new_wrapped(map_info.data)
audio_buffer.pts = global_pts
audio_buffer.duration = buffer.duration
# update the global pts
global_pts += frame_buffer.duration
video_sink.emit("push-buffer", frame_buffer)
audio_sink.emit("push-buffer", audio_buffer)
Now this isn't the full code, but hopefully you get the general idea of how to achieve this. But boy this code is so much fun to write. After running this and watching the resulting video file, we can actually visualize the spectrogram in real-time, alongside the audio stream!
I converted the resulting MKV file into an MP4 file so I can embed it here:
from IPython.display import Video
Video('https://jackmead515.github.io/videos/spectrovid.mp4', html_attributes='loop autoplay muted playsinline width="100%"')
I'm so giddy when things like this actually work. So cool. The video embeded is actually running at 60 FPS and looks so buttery smooth. Now, this is just writing to a regular ole mkv file. But with some very easy adjustments, we could stream this out to an RTSP server via the rtspclientsink gstreamer element.
Perhaps in a not so distant future, I will do just that...
For now. I'm so happy to have a working live spectrogram generator from a live audio stream. For the full code, check out the link here: live spectrogram generator.
Thanks for reading along! I know I haven't posted in a while, but I'm excited to write all the new things I've been learning this year.
And thank you to the OrcaSound project for providing open access to these hydrophone audio streams. Such a cool project!