Skip to content

SeboLab/Realtime-speaker-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Realtime-speaker-identification

What is it?

Uses pyannote and Whisper to get realtime transcriptions along with speaker lables for two users speaking through a single streaming microphone. Credit goes to this youtube video by Tech Giant. Code is largely based on his, but tweaked to work with two people at once

Setup - For Everyone

1. pip install all requirements from requirements.txt

I would recommend running on Python 3.12.0, other versions may produce conflicts

2. Create a folder called voiceprint_audio

  • Will hold the 15s audio clips of each speaker for pyannote to use to create your voiceprint

3. Create a HuggingFace account and get an access token

  • Create your account and go here to get access to pyannote models
  • Then click on your profile on the top-right and make yourself an access token with write permissions
  • Add this token to a .env file, and label it HF_API_KEY=[your HF access token]

4. Create another folder called modules

  • Within this folder, run git clone https://github.com/ggerganov/whisper.cpp.git
  • Go into the whisper.cpp folder and run bash ./models/download-ggml-model.sh base.en
  • This will download the base english model from Whisper

Setup - For Windows

5. Install Cmake at this link

  • Download the Windows 64 installer and run

6. Run the following commands in the modules folder

cd whisper.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

How to run

1. Add your two voiceprint audio files to the voiceprint audio folder

  • Each audio file should be around 15s, and should feature only your voice in a clear non-noisy envrionment
  • Label them as [name].wav so that your speaker labels have names
  • This works for two speakers ONLY, so there must be exactly two audio files here

2. Run main.py and that's it!

  • Transcription along with speaker labels will be printed out into the terminal
  • Print statements with cosine distance will also be printed, which just tells you how likely it was that you were speaking (>0.675 means unlikely, <0.675 means likley)
  • For best performance, try not to speak at the same time

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages