Machine Learning for Voice Recognition: How To Create a Speaker Identification Model in Python

How to build a robust speaker recognition system with Python and PyTorch. This guide covers data preprocessing, model training, and feature extraction. Ideal for developers implementing voice recognition and speaker identification in machine learning projects.

Machine Learning for Voice Recognition: How To Create a Speaker Identification Model in Python
Photo by Kelly Sikkema / Unsplash

A few months ago, I started messing around with machine learning by building a model that could spot cracks in roads. It was super interesting, with a lot of learning involved as I fine-tuned the model until it could actually recognize cracks.

Today, we’re going to take this idea and apply it to audio. I trained a model using Torchaudio and Python to identify speakers, and I’ll walk you through how to set it up yourself.

Data Collection

To build a good model, the quality of the data is key. We need clean data so the model can learn effectively. For this project, I used speeches from U.S. presidents. I wanted to create a model that could recognize who’s speaking.

So, I headed over to the Miller Center website and downloaded the first five speeches from Joe Biden, Donald Trump, and Barack Obama. Then, I set up three folders, one for each president, and saved the speeches as MP3 files in each folder.

Presidential Speeches | Miller Center
data/barack_obama:
total 5632496
-rw-r--r--@ 1 nacho  staff    52M Oct 28 13:36 bho_2015_0626_ClementaPickney.mp3
-rw-r--r--@ 1 nacho  staff    83M Oct 28 13:36 bho_2016_0112_StateoftheUnion.mp3
-rw-r--r--@ 1 nacho  staff    47M Oct 28 13:37 bho_2016_0322_PeopleCuba.mp3
-rw-r--r--@ 1 nacho  staff    58M Oct 28 13:36 bho_2016_0515_RutgersCommencement.mp3
-rw-r--r--@ 1 nacho  staff    71M Oct 28 13:35 obama_farewell_address_to_the_american_people.mp3

data/donald_trump:
total 1023680
-rw-r--r--@ 1 nacho  staff   2.5M Oct 28 13:30 Message_Donald_Trump_post_riot.mp3
-rw-r--r--@ 1 nacho  staff    15M Oct 28 13:31 President_Trump_Remarks_on_2020_Election.mp3
-rw-r--r--@ 1 nacho  staff   932K Oct 28 13:30 Trump_message_to_supporters_during_capitol_riot.mp3
-rw-r--r--@ 1 nacho  staff   4.8M Oct 28 13:29 message_from_trump.mp3
-rw-r--r--@ 1 nacho  staff    18M Oct 28 13:29 trump_farewell_address.mp3

data/joe_biden:
total 1336536
-rw-r--r--@ 1 nacho  staff   9.0M Oct 28 13:33 biden_addresses_nation_2024_07_14.mp3
-rw-r--r--@ 1 nacho  staff    15M Oct 28 13:33 biden_addresses_nation_2024_07_25.mp3
-rw-r--r--@ 1 nacho  staff    34M Oct 28 13:32 biden_remarks_UN_2024_09_24.mp3
-rw-r--r--@ 1 nacho  staff    20M Oct 28 13:33 biden_remarks_on_middle_east.mp3

Once I had this, it's time to start to code the trainer

Building the Trainer

Now that we have our data organized, it’s time to get coding and set up the trainer. We’ll be using PyTorch and Torchaudio to process the audio data and train our model to recognize different speakers.

Step 0: Converting MP3 to WAV

First up, we need our audio data in a consistent format. While MP3 is great for compression, WAV files are often better for training audio models since they don’t lose data through compression. We run through each speaker’s folder, find all the MP3 files, and convert them to WAV format with pydub. This ensures our model can learn from the highest-quality audio possible.

import os
import torch
import torchaudio
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder
from pydub import AudioSegment

# Step 0: Convert MP3 files to WAV format
def convert_mp3_to_wav(root_dir):
    for speaker in os.listdir(root_dir):
        speaker_dir = os.path.join(root_dir, speaker)
        for file_name in os.listdir(speaker_dir):
            if file_name.endswith(".mp3"):
                mp3_path = os.path.join(speaker_dir, file_name)
                wav_path = os.path.join(speaker_dir, f"{os.path.splitext(file_name)[0]}.wav")
                audio = AudioSegment.from_mp3(mp3_path)
                audio.export(wav_path, format="wav")
                print(f"Converted {mp3_path} to {wav_path}")

Step 1: Preparing the Data with SpeakerDataset

To feed our model, we define a custom dataset class called SpeakerDataset. Here’s what’s happening in this part:

  • Loading Data: We loop through each speaker’s folder, collecting each audio file and labeling it based on the speaker’s name. This label will later help our model know which speaker is which.
  • Label Encoding: Since the model works with numbers (not names), we convert each speaker’s name into a unique integer label using LabelEncoder from scikit-learn. This way, the model can focus on learning the patterns in each speaker’s voice, rather than interpreting names.
  • Extracting MFCC Features: MFCC (Mel Frequency Cepstral Coefficients) is a popular technique in audio processing, especially for speech and speaker recognition. It transforms our audio waveform into features that make it easier for the model to learn speaker-specific patterns. We specify parameters like n_mfcc and n_mels to tweak the level of detail the model will learn from, but keeping it simple for now with n_mfcc=13 and n_mels=80.
# Step 1: Define the dataset class for speaker classification
class SpeakerDataset(Dataset):
    def __init__(self, root_dir, n_mfcc=13, n_mels=80):
        self.root_dir = root_dir
        self.speakers = os.listdir(root_dir)
        self.file_paths = []
        self.labels = []
        self.n_mfcc = n_mfcc
        self.n_mels = n_mels
        
        # Label encoding
        self.label_encoder = LabelEncoder()
        self.label_encoder.fit(self.speakers)
        
        # Load file paths and labels
        for speaker in self.speakers:
            speaker_dir = os.path.join(root_dir, speaker)
            for file_name in os.listdir(speaker_dir):
                if file_name.endswith(".wav"):
                    self.file_paths.append(os.path.join(speaker_dir, file_name))
                    self.labels.append(speaker)

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        file_path = self.file_paths[idx]
        label = self.labels[idx]
        waveform, sample_rate = torchaudio.load(file_path)
        mfcc = torchaudio.transforms.MFCC(
            sample_rate=sample_rate, 
            n_mfcc=self.n_mfcc, 
            melkwargs={'n_mels': self.n_mels}
        )(waveform)
        mfcc = mfcc.mean(dim=2).squeeze()  # Reduce to 1D by averaging over time

        # Encode label as integer
        label = self.label_encoder.transform([label])[0]
        return mfcc, label

Step 2: Building the Model with SpeakerClassifier

Now we define our neural network model using a simple architecture. This SpeakerClassifier has two fully connected (FC) layers:

  • First Layer (fc1): Takes our MFCC features as input and learns basic voice patterns.
  • Activation Layer: Adds some non-linearity with ReLU, helping the model capture complex voice patterns.
  • Second Layer (fc2): Outputs predictions for each speaker, giving us the final classification.

This setup is simple but effective, and it’s small enough to run on most machines without needing specialized hardware.

# Step 2: Define the classification model
class SpeakerClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super(SpeakerClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 80)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(80, num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten the MFCC features
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

Step 3: Training the Model

The training loop is where we teach the model to recognize each speaker. Here’s the basic flow:

  1. Forward Pass: The model makes predictions based on the MFCC features.
  2. Calculate Loss: We use CrossEntropyLoss, which is a great choice for classification tasks like ours.
  3. Backpropagation: The model learns from its mistakes and updates its parameters to improve.
  4. Repeat: We run this over 20 epochs to make sure the model gets enough practice with the data.
# Step 3: Train the model
def train_model(model, dataloader, criterion, optimizer, num_epochs=20):
    for epoch in range(num_epochs):
        running_loss = 0.0
        for mfcc, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(mfcc)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(dataloader):.4f}")

Step 4: Putting It All Together

Finally, in the main function, we:

  • Convert our MP3s to WAVs (if they haven’t already been converted).
  • Initialize our dataset and data loader.
  • Set up and train our model with the features from each speaker’s audio.
  • Save the Model: Once training is done, we save the model weights and the label encoder so we can load them later for predictions.

And that’s it! By the end, we have a model ready to identify speakers based on audio input.

# Step 4: Main function to train the speaker classification model
def main():
    # Convert MP3 files to WAV format
    root_dir = 'data/'  # Update with your data directory
    convert_mp3_to_wav(root_dir)

    # Path to your data folder
    dataset = SpeakerDataset(root_dir, n_mfcc=13, n_mels=40)  # Setting n_mels to 80
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

    # Model setup
    sample_mfcc, _ = dataset[0]
    input_size = sample_mfcc.numel()
    num_classes = len(dataset.speakers)
    model = SpeakerClassifier(input_size, num_classes)

    # Training setup
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    train_model(model, dataloader, criterion, optimizer, num_epochs=20)

    # Save the trained model and the label encoder
    torch.save(model.state_dict(), 'speaker_classifier.pth')
    torch.save(dataset.label_encoder, 'label_encoder.pth')
    print("Model and label encoder saved.")

if __name__ == "__main__":
    main()

This setup keeps things simple but effective.

Let's save this under, trainer.py, and execute it. The result, should be something like this:

python3 trainer.py

With the model trained, the next step is testing it out on new audio samples to see how well it recognizes each speaker. Let’s move on to the testing phase!

Converted data/barack_obama/bho_2016_0322_PeopleCuba.mp3 to data/barack_obama/bho_2016_0322_PeopleCuba.wav
Converted data/barack_obama/obama_farewell_address_to_the_american_people.mp3 to data/barack_obama/obama_farewell_address_to_the_american_people.wav
Converted data/barack_obama/bho_2016_0112_StateoftheUnion.mp3 to data/barack_obama/bho_2016_0112_StateoftheUnion.wav
Converted data/barack_obama/bho_2016_0515_RutgersCommencement.mp3 to data/barack_obama/bho_2016_0515_RutgersCommencement.wav
Converted data/barack_obama/bho_2015_0626_ClementaPickney.mp3 to data/barack_obama/bho_2015_0626_ClementaPickney.wav
Converted data/joe_biden/biden_addresses_nation_2024_07_14.mp3 to data/joe_biden/biden_addresses_nation_2024_07_14.wav
Converted data/joe_biden/biden_remarks_on_middle_east.mp3 to data/joe_biden/biden_remarks_on_middle_east.wav
Converted data/joe_biden/biden_addresses_nation_2024_07_25.mp3 to data/joe_biden/biden_addresses_nation_2024_07_25.wav
Converted data/joe_biden/biden_remarks_UN_2024_09_24.mp3 to data/joe_biden/biden_remarks_UN_2024_09_24.wav
Converted data/donald_trump/trump_farewell_address.mp3 to data/donald_trump/trump_farewell_address.wav
Converted data/donald_trump/President_Trump_Remarks_on_2020_Election.mp3 to data/donald_trump/President_Trump_Remarks_on_2020_Election.wav
Converted data/donald_trump/Trump_message_to_supporters_during_capitol_riot.mp3 to data/donald_trump/Trump_message_to_supporters_during_capitol_riot.wav
Converted data/donald_trump/message_from_trump.mp3 to data/donald_trump/message_from_trump.wav
Converted data/donald_trump/Message_Donald_Trump_post_riot.mp3 to data/donald_trump/Message_Donald_Trump_post_riot.wav
Epoch [1/20], Loss: 5.0933
Epoch [2/20], Loss: 2.2040
Epoch [3/20], Loss: 2.5992
Epoch [4/20], Loss: 0.9386
Epoch [5/20], Loss: 0.8153
Epoch [6/20], Loss: 0.4925
Epoch [7/20], Loss: 0.3129
Epoch [8/20], Loss: 0.2327
Epoch [9/20], Loss: 0.2054
Epoch [10/20], Loss: 0.1137
Epoch [11/20], Loss: 0.1498
Epoch [12/20], Loss: 0.0704
Epoch [13/20], Loss: 0.1161
Epoch [14/20], Loss: 0.1120
Epoch [15/20], Loss: 0.1477
Epoch [16/20], Loss: 0.0666
Epoch [17/20], Loss: 0.1880
Epoch [18/20], Loss: 0.0724
Epoch [19/20], Loss: 0.1370
Epoch [20/20], Loss: 0.0584
Model and label encoder saved.

Testing the Model

We’ve trained our model and can see how it improved over each epoch. You should now see two files in your directory: speaker_classifier.pth and label_encoder.pth. These contain the model’s learned parameters and the encoded labels for each speaker. We’ll load them in a moment to put our model to the test.

Show Me the Results

To test the model, we’ll use a fresh speech sample. Head over to the Miller Center website, download any speech by Obama, Trump, or Biden that wasn’t used in the training set, and save it in the data folder with the filename new.mp3.

Let’s dive into the prediction script.

import torch
import torch.nn as nn
import torchaudio
from sklearn.preprocessing import LabelEncoder
from pydub import AudioSegment

# Step 0: Convert new.mp3 to new.wav
def convert_mp3_to_wav(file_path):
    audio = AudioSegment.from_mp3(file_path)
    audio.export("data/new.wav", format="wav")

First, we convert our new MP3 sample into WAV format. This step makes sure the format matches what we used for training, ensuring smoother processing.

Loading the Model

Now we define the same model structure (SpeakerClassifier) and load our trained model and label encoder.

# Step 1: Define the model class (same as used during training)
class SpeakerClassifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super(SpeakerClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, 80)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(80, num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten the MFCC features
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Step 2: Load the model and label encoder
def load_model(input_size, num_classes):
    model = SpeakerClassifier(input_size, num_classes)
    model.load_state_dict(torch.load('speaker_classifier.pth', weights_only=False))
    model.eval()
    return model

label_encoder = torch.load('label_encoder.pth', weights_only=False)

Here, we’re setting up the model just like in training. Then, we load the saved weights (speaker_classifier.pth) and the label encoder (label_encoder.pth) to map predictions back to speaker names.

Preparing the New Audio Sample

Next, we extract MFCC features from the new audio file. This converts the audio into features that our model can work with.

# Step 3: Prepare the new audio sample
def extract_mfcc(file_path, n_mfcc=13, n_mels=40):  # Reduced n_mels to 60
    waveform, sample_rate = torchaudio.load(file_path)
    mfcc = torchaudio.transforms.MFCC(
        sample_rate=sample_rate, 
        n_mfcc=n_mfcc, 
        melkwargs={'n_mels': n_mels}
    )(waveform)
    return mfcc.mean(dim=2).squeeze()

Predicting the Speaker

This is the moment of truth! In predict_speaker, we pass our MFCC features through the model and let it make a prediction. The model’s output is then matched to the corresponding speaker’s name.

# Step 4: Predict speaker
def predict_speaker(file_path, model, label_encoder):
    mfcc = extract_mfcc(file_path).unsqueeze(0)  # Add batch dimension
    with torch.no_grad():
        output = model(mfcc)
        _, predicted = torch.max(output, 1)
        speaker_name = label_encoder.inverse_transform([predicted.item()])[0]
        print(f"Predicted Speaker: {speaker_name}")
        return speaker_name

Running the Prediction

In the last part, we:

  1. Convert our test MP3 file into WAV format.
  2. Calculate the input size from a sample MFCC.
  3. Load the trained model and label encoder.
  4. Run predict_speaker to see if the model correctly identifies the speaker.
# Example usage
if __name__ == "__main__":
    # Convert MP3 to WAV
    convert_mp3_to_wav("data/new.mp3")
    
    # Calculate input size from a sample MFCC
    sample_mfcc = extract_mfcc("data/barack_obama/bho_2015_0626_ClementaPickney.wav")
    input_size = sample_mfcc.numel()
    num_classes = len(label_encoder.classes_)

    model = load_model(input_size, num_classes)
    
    # Predict the speaker from a new audio file
    query_file = "data/new.wav"  # Replace with your query file path
    predict_speaker(query_file, model, label_encoder)

In my case, I saved everything together in a file called predictor.py and ran it:

python3 predictor.py

The result printed on the screen was:

Predicted Speaker: barack_obama

This means it worked! As I mentioned earlier, I selected this speech by Barack Obama randomly. The speech I used was November 15, 2021: Signing the Infrastructure Investment and Jobs Act from the Miller Center.

To Conclude

As you can see, training your own models isn’t too hard. The key is having clean, well-organized data to ensure the model can effectively make predictions—in this case, recognizing the speaker.

For other use cases, you may need more data, depending on the variety of what you’re trying to analyze.

This model is just the beginning. Later, we’ll integrate it with Milvus to store a larger volume of data for deeper analysis and predictions. One goal I have in mind is to build my own assistant, and this code is the first step toward that.

Let me know what you think and if any other use cases come to mind for this approach!