Train a Voice Authentication System in Python with Your Own Voice

Build your own voice authentication system in Python. Record your voice, train a model, and unlock access with just a phrase. No cloud APIs, just pure open-source ML magic, inspired by spy movies, powered by your mic.

Ignacio Van Droogenbroeck

23 Sep 2025 — 10 min read

Photo by Milad Fakurian / Unsplash

Something I’ve always loved in movies are those biometric authentication systems, like fingerprint scans, retina readers, voice recognition. You know, the kind of high-tech stuff you see in spy films like 007. I’m a sucker for that genre.

Just like we did in a previous post where we trained a model to recognize who was speaking, today we’re going one step further: we’re going to train a model with your own voice and build a simple voice-based authentication system.

If you want to read the other article, I leave that here:

So, let’s get started, we’ve got a few fun things to cover. First, why even build something like this?

Because it’s fun. Machine Learning is always fun to me.
Because it’s useful. You can hook this up to a Raspberry Pi, a mic, and a relay to open your door, your toolbox, your fridge, or even your secret stash of snacks.
Because it’s practical. Imagine using it to unlock your computer, access a private folder, trigger a smart home routine, or authenticate access to a server or admin panel.
Because it teaches. It helps me explore what Machine Learning can actually do, and if you know me, you know I love automating things.

Getting Started

So, first things first… let’s break this down into four simple steps:

Record your voice, saying a specific phrase (your “voice password”) multiple times.
Extract audio features, we’ll use MFCCs to convert raw audio into something our model can understand.
Train a machine learning model, so it can learn to recognize your voice and distinguish it from others.
Run real-time voice authentication, speak into the mic, and if it matches your voice password, you’re in.

We won’t hook it up to a container or secure server just yet, but by the end, you’ll have a working voice-based authentication system ready to trigger any action you want next.

What do you think?

Let’s do it.

Step 1: Record Your Voice Password and Negative Samples

Alright, time to lay down the foundation: your voice password.

This is the phrase you’ll train the model to recognize. It could be anything:

“open sesame”, “let me in”, “nacho unlocks things”, or just “access”.

In this step, we’re going to:

Record yourself saying that phrase multiple times (to train the model properly).
Save each sample as a .wav file.
Organize them into folders for easy training later.

But here’s the twist, to teach the model what not you sounds like, we’ll also record a second word (in my case, I used “banana”) as a negative class. That way, the model learns to distinguish your actual password from random or incorrect inputs.

Let’s get it done.

I’m using this simple Python script to record both sets of samples.

First, we’ll record the word “access” 10 times. Then, we’ll run it again and record “banana” into a different folder.

In the next step, we’ll feed all of that into our model for training. See and comment the output dir and phrase according to if you are recording your password or the negative sample.

# record_voice_password.py

import sounddevice as sd
from scipy.io.wavfile import write
import os

SAMPLE_RATE = 16000  # Hz
DURATION = 2  # seconds per sample
NUM_SAMPLES = 20
PHRASE = "access"  # your password phrase
OUTPUT_DIR = f"voice_samples/you_{PHRASE.replace(' ', '_')}"
#PHRASE = "banana"  # or whatever
#OUTPUT_DIR = f"voice_samples/other_access"


os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Let's record your voice saying: \"{PHRASE}\"")
print("You'll record it 10 times. Get ready...")

for i in range(NUM_SAMPLES):
    input(f"\nPress Enter to record sample {i+1}/{NUM_SAMPLES}...")
    print("Recording...")
    audio = sd.rec(int(DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype='int16')
    sd.wait()
    filename = f"{OUTPUT_DIR}/sample_{i}.wav"
    write(filename, SAMPLE_RATE, audio)
    print(f"Saved: {filename}")

print("\nAll done! You've recorded your voice samples.")

Tips:

Record in a quiet space.
Try to speak naturally, but clearly.
You can change PHRASE if you want a different password.
Want to test with another person later? Just record their samples in a different folder like voice_samples/other_access.

The program running will look like this:

➜  auth-voice-system python3 record_voice_password.py

Let's record your voice saying: "access"
You'll record it 10 times. Get ready...

Press Enter to record sample 1/10...
Recording...
Saved: voice_samples/you_access/sample_0.wav

Press Enter to record sample 2/10...
Recording...
Saved: voice_samples/you_access/sample_1.wav

Press Enter to record sample 3/10...
Recording...
Saved: voice_samples/you_access/sample_2.wav

Press Enter to record sample 4/10...
Recording...
Saved: voice_samples/you_access/sample_3.wav

Press Enter to record sample 5/10...
Recording...
Saved: voice_samples/you_access/sample_4.wav

Press Enter to record sample 6/10...
Recording...
Saved: voice_samples/you_access/sample_5.wav

Press Enter to record sample 7/10...
Recording...
Saved: voice_samples/you_access/sample_6.wav

Press Enter to record sample 8/10...
Recording...
Saved: voice_samples/you_access/sample_7.wav

Press Enter to record sample 9/10...
Recording...
Saved: voice_samples/you_access/sample_8.wav

Press Enter to record sample 10/10...
Recording...
Saved: voice_samples/you_access/sample_9.wav

All done! You've recorded your voice samples.

If was recorded, we can see that browsing the folder voice_samples/you_access

➜  auth-voice-system ll voice_samples/you_access 
total 1280
-rw-r--r--  1 nacho  staff    63K Sep 23 11:21 sample_0.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:21 sample_1.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_2.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_3.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_4.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_5.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_6.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_7.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_8.wav
-rw-r--r--  1 nacho  staff    63K Sep 23 11:22 sample_9.wav

Step 2: Extract Audio Features (MFCCs)

Now that you’ve recorded your voice password, we need to convert those .wav files into numerical features.

Why? Because machine learning models don’t understand sound waves directly, they need feature vectors that represent meaningful audio characteristics.

The go-to choice for voice recognition is MFCCs: Mel-Frequency Cepstral Coefficients. Think of them as a compact representation of how your voice sounds.

What We’ll Do

Load your .wav files using librosa
Extract MFCCs from each file
Save those features along with their labels (your identity)
Prepare them for model training

# extract_features.py

import librosa
import numpy as np
import os
import joblib

SAMPLE_RATE = 16000  # Must match your recording rate
N_MFCC = 13          # Number of MFCC features

def extract_features(file_path):
    """Extracts MFCCs from a single audio file."""
    y, sr = librosa.load(file_path, sr=SAMPLE_RATE)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=N_MFCC)
    return np.mean(mfcc.T, axis=0)  # Use mean over time axis for fixed-length vector

def build_dataset(base_dir="voice_samples"):
    """Scans folders and builds dataset of features + labels."""
    X, y = [], []
    for label in os.listdir(base_dir):
        label_path = os.path.join(base_dir, label)
        if not os.path.isdir(label_path):
            continue
        for file in os.listdir(label_path):
            if file.endswith(".wav"):
                file_path = os.path.join(label_path, file)
                features = extract_features(file_path)
                X.append(features)
                y.append(label)
    return np.array(X), np.array(y)

if __name__ == "__main__":
    X, y = build_dataset()

    print(f"Extracted features from {len(X)} audio samples.")
    print(f"Labels found: {set(y)}")

    joblib.dump((X, y), "voice_features.pkl")
    print("Saved features to voice_features.pkl")

We should run and have an output like this:

➜  auth-voice-system python3 extract_features.py
Extracted features from 10 audio samples.
Labels found: {'you_access'}
Saved features to voice_features.pkl

Step 3: Train the Voice Authentication Model

We’re going to use a simple and effective classifier: Support Vector Machine (SVM) via scikit-learn. It’s lightweight, fast to train, and works great for small feature vectors like MFCCs.

What This Step Does

Load the MFCC features (X) and labels (y)
Split into training and testing sets
Train a classifier (we’ll use an SVM)
Evaluate performance
Save the model for use during authentication

# train_model.py

import joblib
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from extract_features import build_dataset

# Load extracted features (from voice_samples/...) or from .pkl file
# Option 1: Extract directly from files
X, y = build_dataset()

# Option 2: Load from pre-saved file
# X, y = joblib.load("voice_features.pkl")

print(f"Training on {len(X)} samples across labels: {set(y)}")

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train a Support Vector Machine classifier
model = SVC(kernel='linear', probability=True)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Save the trained model
joblib.dump(model, "voice_auth_model.pkl")
print("\nModel saved as voice_auth_model.pkl")

If everything goes well, we are going to see an output like this:

➜  auth-voice-system python3 train_model.py      
Training on 20 samples across labels: {'other_access', 'you_access'}

Classification Report:
              precision    recall  f1-score   support

other_access       1.00      1.00      1.00         2
  you_access       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4

Confusion Matrix:
[[2 0]
 [0 2]]

Model saved as voice_auth_model.pkl

Ok, lets see how this works.

Step 4: Real-Time Voice Authentication

Now that we’ve got a trained model (voice_auth_model.pkl), we’re going to:

Record a new voice sample, live
Extract its features (MFCCs, just like before)
Load the model
Predict whether it’s you (or not)
If the model recognizes your voice > grant access (or trigger an action!)

Here is the last piece of code looks like. Get ready with your mic...

# authenticate.py

import sounddevice as sd
from scipy.io.wavfile import write
import joblib
import librosa
import numpy as np
import os

SAMPLE_RATE = 16000
DURATION = 2  # seconds
TMP_AUDIO_FILE = "live_test.wav"

def extract_features(file_path):
    y, sr = librosa.load(file_path, sr=SAMPLE_RATE)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    return np.mean(mfcc.T, axis=0)

# 1. Record new voice sample
print("Say your password after the beep...")
sd.sleep(500)  # Small pause
audio = sd.rec(int(DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype='int16')
sd.wait()
write(TMP_AUDIO_FILE, SAMPLE_RATE, audio)

# 2. Extract features
features = extract_features(TMP_AUDIO_FILE).reshape(1, -1)

# 3. Load model
model = joblib.load("voice_auth_model.pkl")

# 4. Predict
prediction = model.predict(features)[0]
confidence = model.predict_proba(features).max()

print(f"\nPrediction: {prediction} (Confidence: {confidence:.2f})")

# 5. Take action
if prediction == "you_access" and confidence > 0.8:
    print("Access granted!")
    # os.system("docker exec -it secure_container bash")  # example trigger
else:
    print("Access denied.")

Let’s run the authentication script and see what happens:

➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.85)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.59)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.86)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.86)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: other_access (Confidence: 0.95)
Access denied.
➜  auth-voice-system

Hmm… not great. The model is recognizing the “you_access” label occasionally, but with low confidence, so it’s still denying access. And when I say “banana,” it correctly detects other_access, which is what we want.

So we’ve got two ways to improve this:

Lower the confidence threshold (make the system less strict), or
Feed the model more training data, especially for the you_access class.

Let’s go with the second option, better data means better results.

Recording More Samples

We’ll update record_voice_password.py to increase the sample count:

NUM_SAMPLES = 20
PHRASE = "access"  # your password phrase
OUTPUT_DIR = f"voice_samples/you_{PHRASE.replace(' ', '_')}"

Once that’s done, I recorded 20 new samples of my voice saying “access.

Extract Features Again:

➜  auth-voice-system python3 extract_features.py 
Extracted features from 30 audio samples.
Labels found: {'you_access', 'other_access'}
Saved features to voice_features.pkl

Re-Train the Model:

➜  auth-voice-system python3 train_model.py      
Training on 30 samples across labels: {'you_access', 'other_access'}

Classification Report:
              precision    recall  f1-score   support

other_access       1.00      1.00      1.00         2
  you_access       1.00      1.00      1.00         4

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6

Confusion Matrix:
[[2 0]
 [0 4]]

Model saved as voice_auth_model.pkl

Let's try again the auth system.

➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 1.00)
Access granted!
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.99)
Access granted!
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.79)
Access denied.
➜  auth-voice-system python3 authenticate.py
Say your password after the beep...

Prediction: you_access (Confidence: 0.52)
Access denied.

Much better! The model is now more confident in its predictions and is granting access most of the time, especially when I speak clearly and consistently. And it’s still denying when the confidence is too low, which is expected behavior.

Looks like it’s working!

Final Thoughts

And that’s it, we just built a working voice authentication system using Python, your own voice, and a bit of ML magic.

It’s far from production-ready, but it’s a powerful proof of concept:

You trained a model on your own voice
It learned to distinguish your voice from others
And you used it to control access, just like in spy movies

You could easily extend this to:

Unlock files or folders
Trigger smart home devices via Raspberry Pi
Secure local apps or servers
Add multi-user support (more than one voice profile)

You are now one step closer to building your own “Jarvis.”

If you want to check out the full code, explore improvements, or try it yourself, I’ll be uploading everything here.

What to Try Next

If you’re feeling inspired, here are a few ideas to take this further:

Use a better classifier: Try a CNN or use pre-trained speaker embeddings with speechbrain or torchaudio.
Add a GUI: Wrap the whole experience in a Flask, Streamlit, or even Tkinter interface.
Make it hardware-ready: Run it on a Raspberry Pi with a USB mic to unlock real-world things.
Noise resilience: Add background noise to your training samples or apply noise reduction filters.
Spoof protection: Add logic to prevent playback attacks (e.g., use real-time liveness detection).
Continuous learning: Let the model learn and improve with every successful unlock.