In the previous installment, we explored how PieCloudVector facilitates the construction of a product recommendation system based on image data, detailing the complete process from dataset preparation to data vectorization, vector data storage, and similarity searches. This article will further explore the application of PieCloudVector to audio data to achieve audio recognition and classification.
PieCloudVector, as one of the core computing engines of OpenPie's large model data computing system PieDataCS, represents the dimensional upgrade of analytical databases in the era of LLMs, specifically designed for multimodal AI applications. In addition to PieCloudVector, PieDataCS also supports two computing engines, PieCloudDB Database and PieCloudML.
Audio data, being a rich and dynamic type of unstructured data, is capably managed by PieCloudVector, an analytical database designed for the age of large models. Its application in audio data processing not only enhances the accuracy of audio recognition but also boosts the efficiency of audio classification. As the second installment of the "PieCloudVector Advanced Series," this article will illustrate, through the use of audio data, how a vector database can be instrumental in facilitating the vectorization, storage, and similarity search processes of audio data. (All demonstration data in this article comes from Hugging Face.)
When converting audio data into vectors, the sampling rate—the number of samples captured each second—is crucial. It dictates the quality of the data as well as the size of the resulting vectors. Higher sampling rates yield higher quality data but also lead to larger data sets. The example data for this article comes from the MInDS-14 dataset[1] on Hugging Face, which includes customer audio data from the electronic banking domain in 14 different languages, covering a total of 14 intent classes.
We'll break down the process into three main stages: "Dataset Preparation," "Data Dimensionality Reduction and Storage," and "Audio Similarity Search." The complete workflow is visualized in the diagram below:
Dataset Preparation
To begin, we will read in and adjust the audio data, converting it into vectors that will be stored in the database for future calculations. For the Hugging Face dataset and the steps that follow, you'll need the following two Python packages:
Fetch the MInDS-14 dataset from Hugging Face. Given the dataset's modest size, we proceed to load all 563 training entries at once.
from datasets import load_dataset
dataset_minds = load_dataset("PolyAI/minds14", "en-US", split="train")
To prevent the need for repeated downloads, it's advisable to save the dataset locally.
dataset_minds.save_to_disk('minds14_dataset')
dataset_minds = load_from_disk('minds14_dataset')
The dataset includes the following features: path, audio, transcription, English transcription, intent class, and language. In this context, our primary focus will be on the "audio", "intent_class", and "lang_id" features.
dataset_minds.features
{ 'path': Value(dtype='string', id=None),
'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
'transcription': Value(dtype='string', id=None),
'english_transcription': Value(dtype='string', id=None),
'intent_class': Classlabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', freeze', 'high_value_payment', 'joint_account', 'latest_transactions', pay_bill'], id=None),
'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', fr-FR', it-IT', 'ko-KR','nl- NL', 'pl-PL', 'pt-PT, 'ru-RU', 'zh-CN'], id=None)}
Data Dimensionality Reduction and Storage
Since the audio format within the Hugging Face dataset is quite intricate, we'll need to address the audio component of the data upfront. Let's start by considering the first entry in the dataset as a case in point:
In: dataset_minds.['audio'][0]
Out: {'path': '/Users/arlena.wang/.cache/huggingface/datasets/downloads/extracted/fa6d050e601cf0ccf2c2b01238375a56579232af95e398fcef126ea4224e4185/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, 0. , 0. ]),
'sampling_rate': 8000}
An audio data entry includes the audio file path, the audio waveform matrix, and the sampling rate corresponding to the waveform. The IPython tool package in Jupyter Notebook provides an Audio feature that enables us to create a player within the code interface for playing back audio files or vectors, which aids in comprehending the content of the audio. For the inaugural entry in our dataset, we utilize its audio matrix and sampling rate to play the audio:
from IPython.display import Audio
Audio(data=dataset_minds['audio'][0]['array'], rate=dataset_minds['audio'][0]['sampling_rate'])
The dataset features a sampling rate of 8000 Hz. While a higher sampling rate delivers more detailed audio data, lengthy audio files can lead to substantial matrix dimensions, thereby amplifying the computational load. For instance, the audio vector length for the first data entry, at a sampling rate of 8000 Hz, exceeds 80,000 units.
print(print(dataset_minds['audio'][0]['sampling_rate']))
# 8000
print(len(dataset_minds['audio'][0]['array']))
# 86699
Dialing down the sampling rate is a strategic move to cut down on data dimensionality. The Python audio processing library, Librosa, is adept at handling both the loading and processing of audio files and is also a valuable tool for audio analysis. Using the first and second data entries as examples, we'll employ Librosa to tweak the sampling rate of our audio data, thereby refining the data size for more efficient vector storage. It's important to note that, even with the same sampling rate, the matrix length can differ across various audio files.
In: from librosa import resample
print(len(resample(dataset_minds['audio'][0]['array'], orig_sr=8000, target_sr=2000)))
print(len(resample(dataset_minds['audio'][1]['array'], orig_sr=8000, target_sr=2000)))
Out: 32513
19968
The vector length for each data entry varies after the sampling rate adjustment. To directly save these audio vectors to PieCloudVector, it's essential that the dimensionality of the audio vectors is standardized.
The subsequent function utilizes Librosa's fix_length tool to standardize the length of audio vectors, benchmarking against the longest vector and padding the shorter ones with zeros to match its length. However, zero-padding isn't the only option; alternative filling methods can be selected based on the specific traits of the data at hand.
from librosa import resample
from librosa.util import fix_length
max_len = 0
for item in dataset_minds['audio']:
if len(item['array']) > max_len:
max_len = len(item['array'])
def resample_audio_fix(audio):
audio["audio_fix"] = fix_length(resample(audio['audio']['array'], orig_sr=8000, target_sr=2000, fix=True, scale=True), size=max_len)
return audio
updated_fix_dataset = dataset_minds.map(resample_audio_fix)
The current audio vectors contain a significant number of zero values, which we can address by employing dimensionality reduction techniques, such as the PCA algorithm. This not only cuts down on computational expenses but also sharpens the performance of the K-Nearest Neighbors (KNN) algorithm by refining the data set. The workflow after incorporating dimensionality reduction is illustrated in the figure below:
Even with the sampling rate reduced to 2000, the audio vectors remain lengthy, reaching into the tens of thousands. This poses a significant challenge for both vector storage within the database and subsequent vector searches. Hence, we need to adopt an alternative audio processing technique—the Mel Spectrogram. This method is incredibly convenient and widely utilized as it transforms audio data into an "image-like" format, substantially reducing the data's dimensionality.
The Mel Spectrogram is an algorithm that employs Fast Fourier Transform to transition data from the time domain to the frequency domain. For an in-depth understanding of this algorithm, you might find the following article helpful:
This method offers two processing options:
Below is an illustration of employing a Diffusion model to transform data into Mel Spectrograms.
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)
output = pipe(
raw_audio=dataset_minds['audio'][0]['array'],
start_step=int(pipe.get_default_steps() / 2),
mask_start_secs=1,
mask_end_secs=1,
)
The resulting Mel Spectrogram post-conversion is depicted in the figure that follows:
Given the size and computational demands of the Diffusion model, we opted for the Mel Spectrogram conversion function within the Librosa library for audio processing. Let's take the first audio data as an example; post-conversion, we obtain a matrix of dimensions 128 x 170.
In: melspectrogram(y=dataset_minds['audio'][0]['array'], sr=8000).shape
Out: (128, 170)
We then leverage Huggingface's map tool to transform each entry in the dataset.
from librosa.feature import melspectrogram
import PIL
def transform_audio(audio):
audio["audio_image"] = PIL.Image.fromarray(melspectrogram(y=audio['audio']['array'], sr=8000).astype('uint8'))
return audio
updated_dataset = dataset_minds.map(transform_audio)
Upon completion of the conversion, we employ the PIL tool to visualize the resultant vectors.
display(updated_dataset["audio_image"][0])
This approach allows us to model and process audio data in a manner akin to handling image data. Specifically, we'll utilize the Embedding model previously applied to image data to further refine the audio data.
from imgbeddings import imgbeddings
ibed = imgbeddings()
embedding_0 = ibed.to_embeddings(updated_dataset["audio_image"][0])
It can be seen that the converted audio data is a vector of dimension 768.
In: embedding_0.shape
Out: (1, 768)
All Mel Spectrogram matrices after conversion.
embedding = ibed.to_embeddings(updated_dataset['audio_image'])
Following the conversion, we proceed to write the 768-dimensional vectors, intent categories, and language information into PieCloudVector. Initially, we establish the necessary tables within PieCloudVector.
CREATE TABLE if not exists vec_test.banking_audio (id bigserial PRIMARY KEY, embedding vector(768), intent_class varchar(50), lang_id varchar(10));
truncate table vec_test.banking_audio;
On the Python side, we connect to the vector database PieCloudVector, input the data, and the code is as follows:
import psycopg2
embeddings_lst = embedding.tolist()
conn = psycopg2.connect('postgresql://user:passwd@192.163.**.**:5432/pgvec_test')
cur = conn.cursor()
for i in range(len(embeddings_lst)):
cur.execute('INSERT INTO vec_test.banking_audio (embedding, intent_class, lang_id) values (%s,%s,%s)', (embeddings_lst[i], updated_dataset["intent_class"][i], updated_dataset["lang_id"][i]))
conn.commit()
conn.close()
Audio Similarity Search
We employ the K-Nearest Neighbors (KNN) algorithm coupled with L2 Distance metric to identify the 10 closest voice matches to the initial data sample for comparative analysis.
from sqlalchemy import create_engine, text as sql_text
import pandas as pd
engine = create_engine('postgresql://user:passwd@192.163.**.**:5432/pgvec_test', echo=False).connect()
audio_id = pd.read_sql_query(sql=sql_text('select id, intent_class, lang_id from vec_test.banking_audio where id != 1 order by embedding <-> ' + "'" + str(embedding_0.tolist()[0]) + "'" + ' limit 10'),
con=engine)
The results are as follows:
The inaugural dataset pertains to a female inquiring about joint account matters, with her intent and language details outlined as follows:
In: updated_dataset.features["intent_class"].int2str(int(audio_id.loc[0, 'intent_class']))
Out: 'direct_debit'
In: updated_dataset.features["lang_id"].int2str(int(audio_id.loc[0, 'lang_id']))
Out: 'en-US'
Among the top ten records we retrieved, all audio samples are in American English and correspond to ID 4, albeit with varying intents. A straightforward calculation reveals the most frequent intent class among them.
audioid_lst = audio_id['id'].to_list()
def most_common(lst):
return max(set(lst), key=lst.count)
label = most_common([updated_dataset['intent_class'][i] for i in audioid_lst])
print(updated_dataset.features["intent_class"].int2str(int(label)))
The result is 'atm_limit'.
In: updated_dataset.features["intent_class"].int2str(int(label))
Out: 'atm_limit'
By utilizing the Audio function once more to play the audio most similar to our initial sample, we find that the 355th piece of data is an 11-second clip, featuring a female voice inquiring about 'direct_debit' matters.
Audio(data=updated_dataset['audio'][354]['array'], rate=dataset_minds['audio'][354]['sampling_rate'])
While audio similarity search has limitations in recognizing textual content, it excels at capturing similarities at the waveform level, making it particularly adept at comparing the timbre and linguistic nuances of audio. If the input audio is a piece of music, the system can effectively suggest similar tracks to the user by identifying key components such as melody and rhythm, and matching them with entries in the database to recommend music with a comparable style. These recommendations aim to provide an auditory experience akin to the user's preferred music, thereby enhancing user satisfaction and overall experience.
In our forthcoming article, we will delve into how to harness PieCloudVector's text data processing capabilities to construct a ChatBot.