• Español – América Latina
  • Português – Brasil
  • Tiếng Việt
  • TensorFlow Core

Simple audio recognition: Recognizing keywords

This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic automatic speech recognition (ASR) model for recognizing ten different words. You will use a portion of the Speech Commands dataset ( Warden, 2018 ), which contains short (one-second or less) audio clips of commands, such as "down", "go", "left", "no", "right", "stop", "up" and "yes".

Real-world speech and audio recognition systems are complex. But, like image classification with the MNIST dataset , this tutorial should give you a basic understanding of the techniques involved.

Import necessary modules and dependencies. You'll be using tf.keras.utils.audio_dataset_from_directory (introduced in TensorFlow 2.10), which helps generate audio classification datasets from directories of .wav files. You'll also need seaborn for visualization in this tutorial.

Import the mini Speech Commands dataset

To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The original dataset consists of over 105,000 audio files in the WAV (Waveform) audio file format of people saying 35 different words. This data was collected by Google and released under a CC BY license.

Download and extract the mini_speech_commands.zip file containing the smaller Speech Commands datasets with tf.keras.utils.get_file :

The dataset's audio clips are stored in eight folders corresponding to each speech command: no , yes , down , go , left , up , right , and stop :

Divided into directories this way, you can easily load the data using keras.utils.audio_dataset_from_directory .

The audio clips are 1 second or less at 16kHz. The output_sequence_length=16000 pads the short ones to exactly 1 second (and would trim longer ones) so that they can be easily batched.

The dataset now contains batches of audio clips and integer labels. The audio clips have a shape of (batch, samples, channels) .

This dataset only contains single channel audio, so use the tf.squeeze function to drop the extra axis:

The utils.audio_dataset_from_directory function only returns up to two splits. It's a good idea to keep a test set separate from your validation set. Ideally you'd keep it in a separate directory, but in this case you can use Dataset.shard to split the validation set into two halves. Note that iterating over any shard will load all the data, and only keep its fraction.

Let's plot a few audio waveforms:

png

Convert waveforms to spectrograms

The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the short-time Fourier transform (STFT) to convert the waveforms to as spectrograms , which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model.

A Fourier transform ( tf.signal.fft ) converts a signal to its component frequencies, but loses all time information. In comparison, STFT ( tf.signal.stft ) splits the signal into windows of time and runs a Fourier transform on each window, preserving some time information, and returning a 2D tensor that you can run standard convolutions on.

Create a utility function for converting waveforms to spectrograms:

  • The waveforms need to be of the same length, so that when you convert them to spectrograms, the results have similar dimensions. This can be done by simply zero-padding the audio clips that are shorter than one second (using tf.zeros ).
  • When calling tf.signal.stft , choose the frame_length and frame_step parameters such that the generated spectrogram "image" is almost square. For more information on the STFT parameters choice, refer to this Coursera video on audio signal processing and STFT.
  • The STFT produces an array of complex numbers representing magnitude and phase. However, in this tutorial you'll only use the magnitude, which you can derive by applying tf.abs on the output of tf.signal.stft .

Next, start exploring the data. Print the shapes of one example's tensorized waveform and the corresponding spectrogram, and play the original audio:

Your browser does not support the audio element.

Now, define a function for displaying a spectrogram:

Plot the example's waveform over time and the corresponding spectrogram (frequencies over time):

png

Now, create spectrogram datasets from the audio datasets:

Examine the spectrograms for different examples of the dataset:

png

Build and train the model

Add Dataset.cache and Dataset.prefetch operations to reduce read latency while training the model:

For the model, you'll use a simple convolutional neural network (CNN), since you have transformed the audio files into spectrogram images.

Your tf.keras.Sequential model will use the following Keras preprocessing layers:

  • tf.keras.layers.Resizing : to downsample the input to enable the model to train faster.
  • tf.keras.layers.Normalization : to normalize each pixel in the image based on its mean and standard deviation.

For the Normalization layer, its adapt method would first need to be called on the training data in order to compute aggregate statistics (that is, the mean and the standard deviation).

Configure the Keras model with the Adam optimizer and the cross-entropy loss:

Train the model over 10 epochs for demonstration purposes:

Let's plot the training and validation loss curves to check how your model has improved during training:

png

Evaluate the model performance

Run the model on the test set and check the model's performance:

Display a confusion matrix

Use a confusion matrix to check how well the model did classifying each of the commands in the test set:

png

Run inference on an audio file

Finally, verify the model's prediction output using an input audio file of someone saying "no". How well does your model perform?

png

As the output suggests, your model should have recognized the audio command as "no".

Export the model with preprocessing

The model's not very easy to use if you have to apply those preprocessing steps before passing data to the model for inference. So build an end-to-end version:

Test run the "export" model:

Save and reload the model, the reloaded model gives identical output:

This tutorial demonstrated how to carry out simple audio classification/automatic speech recognition using a convolutional neural network with TensorFlow and Python. To learn more, consider the following resources:

  • The Sound classification with YAMNet tutorial shows how to use transfer learning for audio classification.
  • The notebooks from Kaggle's TensorFlow speech recognition challenge .
  • The TensorFlow.js - Audio recognition using transfer learning codelab teaches how to build your own interactive web app for audio classification.
  • A tutorial on deep learning for music information retrieval (Choi et al., 2017) on arXiv.
  • TensorFlow also has additional support for audio data preparation and augmentation to help with your own audio-based projects.
  • Consider using the librosa library for music and audio analysis.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-08-16 UTC.

Audio Sampling and Sample Rate

Audio sampling, or sampling, refers to the process of converting a continuous analog audio signal into a discrete digital signal. This is achieved by taking “snapshots”, i.e., samples, of the audio signal at regular intervals.

The sampling rate refers to the number of samples per second (or Hertz) taken from a continuous signal to make a discrete or digital signal. In simpler terms, it's how many times per second an audio signal is checked and its level recorded. The higher the sample rate, the more “snapshots” and the more detailed the digital representation of the sound wave.

Common Audio Sample Rates

There are several common sample rates used in digital audio. The choice of sample rate often depends on the intended use:

  • 96 kHz and 192 kHz : These high-definition rates are reserved for professional music production and certain streaming services catering to audiophiles.
  • 48 kHz : Adopted by the film and TV industry for clear audio syncing with video.
  • 44.1 kHz : The de facto standard for music CDs and most digital audio players, balancing quality with file size.
  • 16 kHz : Strikes a balance between quality and file size, used in voice commands and speech recognition technologies. (Yes, Picovoice engines require 16 kHz.)
  • 8 kHz : This low rate is used when bandwidth is limited, it’s typical for telecommunication systems like old-school phone calls.

Upsampling and Downsampling

Upsampling and downsampling are the processes of changing an audio file’s sample rate:

Upsampling : As the name suggests, upsampling is the process of increasing the sample rate. While it doesn't improve the original audio quality beyond its initial recording, it can make the audio compatible with systems that require a higher sample rate or improve the performance of certain digital audio processing effects.

Downsampling : Downsampling refers to the process of decreasing the sampling rate. This is typically done to reduce the file size of an audio signal or to make it compatible with another system. Downsampling can result in a loss of audio quality if not done correctly, as it involves the removal of data. A low-pass filter is typically employed before downsampling to prevent aliasing, the distortion that occurs when the signal reconstructed from samples differs from the original continuous signal.

Why is 16 kHz Popular?

16kHz is a popular sampling rate for several reasons. First and foremost, it strikes a balance between audio quality and file size. At 16kHz, audio files are small enough to store and transmit while offering reasonable audio quality. Secondly, the human voice's most critical frequencies lie between 300Hz and 3400Hz. The Nyquist-Shannon sampling theorem states that a sampling rate of at least twice the highest frequency is required for accurate signal representation. 16kHz is more than twice 3400Hz and sufficient for processing the human voice. That’s why 16kHz has become a standard in applications using human speech and voice.

Many telephone systems and Voice over Internet Protocol (VoIP) services use 16kHz because it captures the essential range of human speech while minimizing data usage. Voice AI applications, such as virtual assistants and dictation software, often use 16kHz as it provides sufficient quality for accurate speech analysis. Even some audiobook and podcast platforms use 16kHz to reduce file sizes and make content more accessible to users with limited bandwidth or storage.

Although 16kHz is the accepted industry standard for voice AI, Picovoice Consulting works with enterprise customers to optimize the engines or audio inputs when custom solutions are needed.

Subscribe to our newsletter

More from Picovoice

Blog Thumbnail

As the demand for large language models (LLMs) continues to grow, so does the need for efficient and cost-effective deployment solutions.

Blog Thumbnail

Over the years, Large Language Models (LLMs) have dominated the scene. However, a notable shift is underway towards Small Language Models (S...

Blog Thumbnail

Dual Streaming Text-to-Speech allows to synthesize an incoming stream of text into consistent audio in real time, making it ideal for latenc...

Blog Thumbnail

Create an on-device, LLM-powered Voice Assistant for iOS using Picovoice on-device voice AI and picoLLM local LLM platforms.

Blog Thumbnail

Create an on-device, LLM-powered Voice Assistant for Android using Picovoice on-device voice AI and picoLLM local LLM platforms.

Blog Thumbnail

Create a local LLM-powered Voice Assistant for Web Browsers using Picovoice on-device voice AI and picoLLM local LLM platforms.

Blog Thumbnail

Create an on-device LLM-powered Voice Assistant in 400 lines of Python using Picovoice on-device voice AI and picoLLM local LLM platforms.

Blog Thumbnail

Learn how to perform Speech Recognition in JavaScript, including Speech-to-Text, Voice Commands, Wake Word Detection, and Voice Activity Det...

Detailed Guide on Sample Rate for ASR! [2023]

Sample rate for speech recognition

🎧 Listen to this blog

Subscribe 📨 to FutureBeeAI’s News Letter

• The Fundamentals of Automatic Speech Recognition (ASR)

• sample rate explained, why higher sample rate produce better audio quality, limitation of higher sample rate, how to choose the right sample rate for asr, recommended sample rates for various asr use cases, • futurebeeai is here to assist you.

In our blog, we love exploring a bunch of cool AI stuff, like making computers understand speech, recognize things in pictures, create art, chat like humans, read handwriting, and much more!

Today, we're diving deep into a crucial aspect of training speech recognition models: the sample rate. We'll keep things simple and explain why it's a big deal.

By the end of this blog, you'll know exactly how to pick the perfect sample rate for your speech recognition project and why it matters so much! So, let's get started!

The Fundamentals of Automatic Speech Recognition (ASR)

Automatic Speech Recognition, or ASR for short, is a branch of artificial intelligence dedicated to the conversion of spoken words into written text. For an ASR model to effectively understand any language, it must undergo rigorous training using a substantial amount of spoken language data in that particular language.

This speech dataset comprises audio files recorded in the target language, along with their corresponding transcriptions . These audio files consist of recordings featuring human speech, and a crucial technical aspect of these files is the sample rate along with bit depth, format etc. We will discuss other technical features later in the future.

When training our ASR model, we have two main options: utilizing open-source datasets, off-the-shelf datasets, or creating our own custom training dataset. In the case of open-source or off-the-shelf datasets, it is essential to verify the sample rate at which the audio data was recorded. For custom speech dataset collection , it is equally vital to ensure that all audio data is recorded at the specified sample rate.

In summary, the selection of audio files with the required sample rate plays a pivotal role in the ASR training process. To gain a deeper understanding of sample rate, let's delve into its intricacies.

Sample Rate Explained

Let's dive into the concept of sample rate. In simple terms, sample rate refers to the number of audio samples captured in one second. You might also hear it called sampling frequency or sampling rate.

To measure the sample rate, we use Hertz (Hz) as the unit of measurement. Often, you'll see it expressed in kilohertz (kHz) in everyday discussions.

Now, let's visualize what the sample rate looks like on an audio graph.

Sample Rate Graph

The red line in the graph represents the sound signal, while the yellow dots scattered along it represent individual samples. Think of sample rate as a measure of how many of these samples are taken in a single second. For instance, if you have an audio file with an 8 kHz sample rate, it means that 8,000 samples are captured per second for that audio file.

Now, imagine you want to recreate the sound signal from these samples. Which scenario do you think would make it easier: having a high sample rate or a low one?

To clarify, think of the graph again. If you have more dots (samples), you can reconstruct the sound signal more accurately compared to having fewer dots. Essentially, a higher sample rate means a more detailed representation of the audio signal, allowing you to encode more information and ultimately resulting in better audio quality.

So, if you have two audio files, one with an 8 kHz sample rate and another with a 48 kHz sample rate, the 48 kHz file will generally sound much better.

Let's dive into why a higher sample rate allows for more information to be encoded.

Picture trying to capture images of a fast-moving car on a road. Your frequency of capturing images can be likened to the sample rate. If your capture frequency is too low, you'll miss important moments because the car is moving too quickly.

But if your capture frequency is high, you can capture each crucial moment, making it possible to faithfully reproduce the visual.

This same principle applies to audio. If your sample rate is low, meaning you're capturing fewer sound signals in a given time, you might miss subtle nuances in speech. Consequently, when you attempt to reproduce the audio, it won't match the original quality.

However, when you have a high enough sample rate, you capture all the nuances of speech, enabling accurate audio reproduction.

In fact, with a sufficiently high sample rate, you can reproduce audio so accurately that humans can't distinguish it from the original.

But what qualifies as a "high enough" sample rate? Does this mean that a higher sample rate is always better?

Not necessarily. Using the image analogy again, if your capture frequency is excessively high, you might end up with duplicate images. Similarly, in audio, an excessively high sample rate can capture unnecessary background noise and other irrelevant details.

To determine the right sample rate, we turn to Nyquist's theorem . This theorem suggests that to avoid aliasing and accurately capture a signal, you should sample it at a rate at least twice the highest frequency you want to capture.

For humans, our ears are sensitive to frequencies between 20 Hz and 20 kHz. Following Nyquist, the optimal sample rate for us would be 40 kHz. This is why most music CDs are recorded with sample rates of 44 kHz to 48 kHz, with the additional 4 kHz to 8 kHz serving as a buffer to prevent data loss during the analog-to-digital conversion process.

However, despite its high audio quality, a 48 kHz audio file may not be suitable for training Automatic Speech Recognition (ASR) models due to several reasons:

High sample rates require more computational power, making them less practical for certain applications.

Increased computational demands result in higher power consumption, leading to a larger carbon footprint.

Audio files with higher sample rates have larger file sizes, necessitating more storage space.

Larger file sizes also mean slower data transmission between modules.

As discussed earlier, higher sample rate the audio signal will try to contain more information and it sometimes captures the background noise as well which can lead to noise amplification as well.

Not all ASR systems or AI modules support high sample rates, which can limit interoperability.

Now the question is, then how to choose the optimal sample rate for the ASR system? Let’s find out its answer.

It primarily depends on the use case and the frequency range of human speech. Human speech intelligibility typically falls within the range of 300 Hz to 3400 Hz. Doubling the upper limit according to Nyquist, a sample rate of around 8000 Hz is sufficient to capture human voice accurately. This is why 8 kHz is commonly used in speech recognition systems, telecommunication channels, and codecs.

With enough quality 8 kHz also brings the advantages of lower computational power, power consumption, and lower amount of data that needs to be transferred. But that doesn’t mean 8 kHz is the best quality, it’s rather a sweet spot between the tradeoff of quality and limitation.

As mentioned earlier choosing the right sample rate also depends upon the use case. Many HD voice devices use 16 kHz as it provides more accurate high-frequency information compared to 8 kHz. So it’s like if you have more computational power to train your AI model you can choose 16 kHz in place of 8 kHz.

In most cases, ASR models for voice recognition tasks often do not require sample rates exceeding 22 kHz. On the other hand, in scenarios where exceptional audio quality is essential, such as music and audio production, a sample rate of 44 kHz to 48 kHz is preferred.

For Text-to-Speech (TTS) applications, which require detailed acoustic characteristics, sample rates of 22.05 kHz, 32 kHz, 44.1 kHz, or 48 kHz are used to ensure accurate audio reproduction from text.

We are clear till now that choosing the optimal sample rate depends on your use case. Below are some of the common ASR use cases and generally used sample rates for them.

Voice Assistants (e.g., Siri, Alexa, Google Assistant):

- Optimal Sample Rate: 16 kHz to 48 kHz - These applications prioritize high-quality audio for natural language understanding. Sample rates between 16 kHz and 48 kHz are often used to capture clear and detailed voice input.

Conversational AI, Telephony, and IVR Systems:

- Optimal Sample Rate: 8 kHz - Traditional telephone systems, Interactive Voice Response (IVR) systems, and call center asr solutions typically use an 8 kHz sample rate to match telephony standards.

Transcription Services (e.g., Speech-to-Text):

- Optimal Sample Rate: 16 kHz to 48 kHz - When transcribing spoken content for applications like transcription services, podcasts, or video captions, higher sample rates in the range of 16 kHz to 48 kHz are often preferred for accuracy.

Medical Transcription and Dictation:

- Optimal Sample Rate: 16 kHz to 48 kHz - Medical transcription and dictation applications typically benefit from higher sample rates to capture medical professionals' detailed speech accurately.

Remember that the optimal sample rate can vary based on the specific requirements and constraints of each ASR use case. It's essential to conduct testing and evaluation to determine the best sample rate for your application while considering factors like audio quality, computational resources, and the intended environment.

FutureBeeAI is Here to Assist You!

We at FutureBeeAI assist AI organizations working on any ASR use cases with our extensive speech data offerings. With our pre-made datasets including general conversation, call center conversation or scripted monologue you can scale your AI model development. All of these datasets are diverse across 40+ languages and 6+ industries. You can check out all the published speech data here .

Speech Data collection app Yugo

Along with that with our state-of-the-art mobile application and global crowd community, you can collect custom speech datasets as per your tailored requirements. Our data collection mobile application Yugo allows you to record both scripted and conversational speech data with flexible technical features like sample rate, bit depth, file format, and audio channels. Check out our Yugo application here .

Feel free to reach out to us in case you need any help with training datasets for your ASR use cases. We would love to assist you!

Read more Blogs

Gather bespoke speech data effortlessly and quickly with the simplest and fastest approach available

Custom Speech Data Collection

The easiest and quickest way to collect custom speech dataset.

How to prepare training dataset for speech recognition

Training Data Training Data Preparation

How to prepare training data for speech recognition models.

Speech recognition vs voice recognition

Speech Recognition Voice Recognition

Speech recognition vs. voice recognition: in depth comparisonr, supercharge your model creation with futurebeeai’s premium quality datasets.

Prompt Contact Arrow

We Use Cookies!!!

We use cookies to ensure that we give you the best experience on our website. Read cookies policies.

cookie-icon

A Complete Guide to Audio Datasets

sample rate for speech recognition

*]:break-words" href="#introduction" title="Introduction"> Introduction *]:break-words" href="#contents" title="Contents"> Contents *]:break-words" href="#the-hub" title="The Hub"> The Hub *]:break-words" href="#load-an-audio-dataset" title="Load an Audio Dataset"> Load an Audio Dataset *]:break-words" href="#easy-to-load-easy-to-process" title="Easy to Load, Easy to Process"> Easy to Load, Easy to Process 1. Resampling the Audio Data 2. Pre-Processing Function 3. Filtering Function *]:break-words" href="#streaming-mode-the-silver-bullet" title="Streaming Mode: The Silver Bullet"> Streaming Mode: The Silver Bullet *]:break-words" href="#a-tour-of-audio-datasets-on-the-hub" title="A Tour of Audio Datasets on The Hub"> A Tour of Audio Datasets on The Hub English Speech Recognition Multilingual Speech Recognition Speech Translation Audio Classification *]:break-words" href="#closing-remarks" title="Closing Remarks"> Closing Remarks Introduction

🤗 Datasets is an open-source library for downloading and preparing datasets from all domains. Its minimalistic API allows users to download and prepare datasets in just one line of Python code, with a suite of functions that enable efficient pre-processing. The number of datasets available is unparalleled, with all the most popular machine learning datasets available to download.

Not only this, but 🤗 Datasets comes prepared with multiple audio-specific features that make working with audio datasets easy for researchers and practitioners alike. In this blog, we'll demonstrate these features, showcasing why 🤗 Datasets is the go-to place for downloading and preparing audio datasets.

Load an Audio Dataset

Easy to load, easy to process, streaming mode: the silver bullet.

  • A Tour of Audio Datasets on the Hub

Closing Remarks

The Hugging Face Hub is a platform for hosting models, datasets and demos, all open source and publicly available. It is home to a growing collection of audio datasets that span a variety of domains, tasks and languages. Through tight integrations with 🤗 Datasets, all the datasets on the Hub can be downloaded in one line of code.

Let's head to the Hub and filter the datasets by task:

  • Speech Recognition Datasets on the Hub
  • Audio Classification Datasets on the Hub

Trulli

At the time of writing, there are 77 speech recognition datasets and 28 audio classification datasets on the Hub, with these numbers ever-increasing. You can select any one of these datasets to suit your needs. Let's check out the first speech recognition result. Clicking on common_voice brings up the dataset card:

Trulli

Here, we can find additional information about the dataset, see what models are trained on the dataset and, most excitingly, listen to actual audio samples. The Dataset Preview is presented in the middle of the dataset card. It shows us the first 100 samples for each subset and split. What's more, it's loaded up the audio samples ready for us to listen to in real-time. If we hit the play button on the first sample, we can listen to the audio and see the corresponding text.

The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can start using it.

One of the key defining features of 🤗 Datasets is the ability to download and prepare a dataset in just one line of Python code. This is made possible through the load_dataset function. Conventionally, loading a dataset involves: i) downloading the raw data, ii) extracting it from its compressed format, and iii) preparing individual samples and splits. Using load_dataset , all of the heavy lifting is done under the hood.

Let's take the example of loading the GigaSpeech dataset from Speech Colab. GigaSpeech is a relatively recent speech recognition dataset for benchmarking academic speech systems and is one of many audio datasets available on the Hugging Face Hub.

To load the GigaSpeech dataset, we simply take the dataset's identifier on the Hub ( speechcolab/gigaspeech ) and specify it to the load_dataset function. GigaSpeech comes in five configurations of increasing size, ranging from xs (10 hours) to xl (10,000 hours). For the purpose of this tutorial, we'll load the smallest of these configurations. The dataset's identifier and the desired configuration are all that we require to download the dataset:

Print Output:

And just like that, we have the GigaSpeech dataset ready! There simply is no easier way of loading an audio dataset. We can see that we have the training, validation and test splits pre-partitioned, with the corresponding information for each.

The object gigaspeech returned by the load_dataset function is a DatasetDict . We can treat it in much the same way as an ordinary Python dictionary. To get the train split, we pass the corresponding key to the gigaspeech dictionary:

This returns a Dataset object, which contains the data for the training split. We can go one level deeper and get the first item of the split. Again, this is possible through standard Python indexing:

We can see that there are a number of features returned by the training split, including segment_id , speaker , text , audio and more. For speech recognition, we'll be concerned with the text and audio columns.

Using 🤗 Datasets' remove_columns method, we can remove the dataset features not required for speech recognition:

Let's check that we've successfully retained the text and audio columns:

Great! We can see that we've got the two required columns text and audio . The text is a string with the sample transcription and the audio a 1-dimensional array of amplitude values at a sampling rate of 16KHz. That's our dataset loaded!

Loading a dataset with 🤗 Datasets is just half of the fun. We can now use the suite of tools available to efficiently pre-process our data ready for model training or inference. In this Section, we'll perform three stages of data pre-processing:

  • Resampling the Audio Data
  • Pre-Processing Function
  • Filtering Function

1. Resampling the Audio Data

The load_dataset function prepares audio samples with the sampling rate that they were published with. This is not always the sampling rate expected by our model. In this case, we need to resample the audio to the correct sampling rate.

We can set the audio inputs to our desired sampling rate using 🤗 Datasets' cast_column method. This operation does not change the audio in-place, but rather signals to datasets to resample the audio samples on the fly when they are loaded. The following code cell will set the sampling rate to 8kHz:

Re-loading the first audio sample in the GigaSpeech dataset will resample it to the desired sampling rate:

We can see that the sampling rate has been downsampled to 8kHz. The array values are also different, as we've now only got approximately one amplitude value for every two that we had before. Let's set the dataset sampling rate back to 16kHz, the sampling rate expected by most speech recognition models:

Easy! cast_column provides a straightforward mechanism for resampling audio datasets as and when required.

2. Pre-Processing Function

One of the most challenging aspects of working with audio datasets is preparing the data in the right format for our model. Using 🤗 Datasets' map method, we can write a function to pre-process a single sample of the dataset, and then apply it to every sample without any code changes.

First, let's load a processor object from 🤗 Transformers. This processor pre-processes the audio to input features and tokenises the target text to labels. The AutoProcessor class is used to load a processor from a given model checkpoint. In the example, we load the processor from OpenAI's Whisper medium.en checkpoint, but you can change this to any model identifier on the Hugging Face Hub:

Great! Now we can write a function that takes a single training sample and passes it through the processor to prepare it for our model. We'll also compute the input length of each audio sample, information that we'll need for the next data preparation step:

We can apply the data preparation function to all of our training examples using 🤗 Datasets' map method. Here, we also remove the text and audio columns, since we have pre-processed the audio to input features and tokenised the text to labels:

3. Filtering Function

Prior to training, we might have a heuristic for filtering our training data. For instance, we might want to filter any audio samples longer than 30s to prevent truncating the audio samples or risking out-of-memory errors. We can do this in much the same way that we prepared the data for our model in the previous step.

We start by writing a function that indicates which samples to keep and which to discard. This function, is_audio_length_in_range , returns a boolean: samples that are shorter than 30s return True, and those that are longer False.

We can apply this filtering function to all of our training examples using 🤗 Datasets' filter method, keeping all samples that are shorter than 30s (True) and discarding those that are longer (False):

And with that, we have the GigaSpeech dataset fully prepared for our model! In total, this process required 13 lines of Python code, right from loading the dataset to the final filtering step.

Keeping the notebook as general as possible, we only performed the fundamental data preparation steps. However, there is no restriction to the functions you can apply to your audio dataset. You can extend the function prepare_dataset to perform much more involved operations, such as data augmentation, voice activity detection or noise reduction. With 🤗 Datasets, if you can write it in a Python function, you can apply it to your dataset!

One of the biggest challenges faced with audio datasets is their sheer size. The xs configuration of GigaSpeech contained just 10 hours of training data, but amassed over 13GB of storage space for download and preparation. So what happens when we want to train on a larger split? The full xl configuration contains 10,000 hours of training data, requiring over 1TB of storage space. For most speech researchers, this well exceeds the specifications of a typical hard drive disk. Do we need to fork out and buy additional storage? Or is there a way we can train on these datasets with no disk space constraints ?

🤗 Datasets allow us to do just this. It is made possible through the use of streaming mode, depicted graphically in Figure 1. Streaming allows us to load the data progressively as we iterate over the dataset. Rather than downloading the whole dataset at once, we load the dataset sample by sample. We iterate over the dataset, loading and preparing samples on the fly when they are needed. This way, we only ever load the samples that we're using, and not the ones that we're not! Once we're done with a sample, we continue iterating over the dataset and load the next one.

This is analogous to downloading a TV show versus streaming it. When we download a TV show, we download the entire video offline and save it to our disk. We have to wait for the entire video to download before we can watch it and require as much disk space as size of the video file. Compare this to streaming a TV show. Here, we don’t download any part of the video to disk, but rather iterate over the remote video file and load each part in real-time as required. We don't have to wait for the full video to buffer before we can start watching, we can start as soon as the first portion of the video is ready! This is the same streaming principle that we apply to loading datasets.

Streaming mode has three primary advantages over downloading the entire dataset at once:

  • Disk space: samples are loaded to memory one-by-one as we iterate over the dataset. Since the data is not downloaded locally, there are no disk space requirements, so you can use datasets of arbitrary size.
  • Download and processing time: audio datasets are large and need a significant amount of time to download and process. With streaming, loading and processing is done on the fly, meaning you can start using the dataset as soon as the first sample is ready.
  • Easy experimentation: you can experiment on a handful samples to check that your script works without having to download the entire dataset.

There is one caveat to streaming mode. When downloading a dataset, both the raw data and processed data are saved locally to disk. If we want to re-use this dataset, we can directly load the processed data from disk, skipping the download and processing steps. Consequently, we only have to perform the downloading and processing operations once, after which we can re-use the prepared data. With streaming mode, the data is not downloaded to disk. Thus, neither the downloaded nor pre-processed data are cached. If we want to re-use the dataset, the streaming steps must be repeated, with the audio files loaded and processed on the fly again. For this reason, it is advised to download datasets that you are likely to use multiple times.

How can you enable streaming mode? Easy! Just set streaming=True when you load your dataset. The rest will be taken care for you:

All the steps covered so far in this tutorial can be applied to the streaming dataset without any code changes. The only change is that you can no longer access individual samples using Python indexing (i.e. gigaspeech["train"][sample_idx] ). Instead, you have to iterate over the dataset, using a for loop for example.

Streaming mode can take your research to the next level: not only are the biggest datasets accessible to you, but you can easily evaluate systems over multiple datasets in one go without worrying about your disk space. Compared to evaluating on a single dataset, multi-dataset evaluation gives a better metric for the generalisation abilities of a speech recognition system ( c.f. End-to-end Speech Benchmark (ESB) ). The accompanying Google Colab provides an example for evaluating the Whisper model on eight English speech recognition datasets in one script using streaming mode.

A Tour of Audio Datasets on The Hub

This Section serves as a reference guide for the most popular speech recognition, speech translation and audio classification datasets on the Hugging Face Hub. We can apply everything that we've covered for the GigaSpeech dataset to any of the datasets on the Hub. All we have to do is switch the dataset identifier in the load_dataset function. It's that easy!

English Speech Recognition

Multilingual speech recognition, speech translation, audio classification.

Speech recognition, or speech-to-text, is the task of mapping from spoken speech to written text, where both the speech and text are in the same language. We provide a summary of the most popular English speech recognition datasets on the Hub:

Dataset Domain Speaking Style Train Hours Casing Punctuation License Recommended Use
Audiobook Narrated 960 CC-BY-4.0 Academic benchmarks
Wikipedia Narrated 2300 CC0-1.0 Non-native speakers
European Parliament Oratory 540 CC0 Non-native speakers
TED talks Oratory 450 CC-BY-NC-ND 3.0 Technical topics
Audiobook, podcast, YouTube Narrated, spontaneous 10000 apache-2.0 Robustness over multiple domains
Fincancial meetings Oratory, spontaneous 5000 User Agreement Fully formatted transcriptions
Fincancial meetings Oratory, spontaneous 119 CC-BY-SA-4.0 Diversity of accents
Meetings Spontaneous 100 CC-BY-4.0 Noisy speech conditions

Refer to the Google Colab for a guide on evaluating a system on all eight English speech recognition datasets in one script.

The following dataset descriptions are largely taken from the ESB Benchmark paper.

LibriSpeech ASR

LibriSpeech is a standard large-scale dataset for evaluating ASR systems. It consists of approximately 1,000 hours of narrated audiobooks collected from the LibriVox project. LibriSpeech has been instrumental in facilitating researchers to leverage a large body of pre-existing transcribed speech data. As such, it has become one of the most popular datasets for benchmarking academic speech systems.

Common Voice

Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. Since anyone can contribute recordings, there is significant variation in both audio quality and speakers. The audio conditions are challenging, with recording artefacts, accented speech, hesitations, and the presence of foreign words. The transcriptions are both cased and punctuated. The English subset of version 11.0 contains approximately 2,300 hours of validated data. Use of the dataset requires you to agree to the Common Voice terms of use, which can be found on the Hugging Face Hub: mozilla-foundation/common_voice_11_0 . Once you have agreed to the terms of use, you will be granted access to the dataset. You will then need to provide an authentication token from the Hub when you load the dataset.

VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique domain of oratory, political speech, largely sourced from non-native speakers. The English subset contains approximately 550 hours labelled speech.

TED-LIUM is a dataset based on English-language TED Talk conference videos. The speaking style is oratory educational talks. The transcribed talks cover a range of different cultural, political, and academic topics, resulting in a technical vocabulary. The Release 3 (latest) edition of the dataset contains approximately 450 hours of training data. The validation and test data are from the legacy set, consistent with earlier releases.

GigaSpeech is a multi-domain English speech recognition corpus curated from audiobooks, podcasts and YouTube. It covers both narrated and spontaneous speech over a variety of topics, such as arts, science and sports. It contains training splits varying from 10 hours - 10,000 hours and standardised validation and test splits.

SPGISpeech is an English speech recognition corpus composed of company earnings calls that have been manually transcribed by S&P Global, Inc. The transcriptions are fully-formatted according to a professional style guide for oratory and spontaneous speech. It contains training splits ranging from 200 hours - 5,000 hours, with canonical validation and test splits.

Earnings-22

Earnings-22 is a 119-hour corpus of English-language earnings calls collected from global companies. The dataset was developed with the goal of aggregating a broad range of speakers and accents covering a range of real-world financial topics. There is large diversity in the speakers and accents, with speakers taken from seven different language regions. Earnings-22 was published primarily as a test-only dataset. The Hub contains a version of the dataset that has been partitioned into train-validation-test splits.

AMI comprises 100 hours of meeting recordings captured using different recording streams. The corpus contains manually annotated orthographic transcriptions of the meetings aligned at the word level. Individual samples of the AMI dataset contain very large audio files (between 10 and 60 minutes), which are segmented to lengths feasible for training most speech recognition systems. AMI contains two splits: IHM and SDM. IHM (individual headset microphone) contains easier near-field speech, and SDM (single distant microphone) harder far-field speech.

Multilingual speech recognition refers to speech recognition (speech-to-text) for all languages except English.

Multilingual LibriSpeech

Multilingual LibriSpeech is the multilingual equivalent of the LibriSpeech ASR corpus. It comprises a large corpus of read audiobooks taken from the LibriVox project, making it a suitable dataset for academic research. It contains data split into eight high-resource languages - English, German, Dutch, Spanish, French, Italian, Portuguese and Polish.

Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. Since anyone can contribute recordings, there is significant variation in both audio quality and speakers. The audio conditions are challenging, with recording artefacts, accented speech, hesitations, and the presence of foreign words. The transcriptions are both cased and punctuated. As of version 11, there are over 100 languages available, both low and high-resource.

VoxPopuli is a large-scale multilingual speech corpus consisting of data sourced from 2009-2020 European Parliament event recordings. Consequently, it occupies the unique domain of oratory, political speech, largely sourced from non-native speakers. It contains labelled audio-transcription data for 15 European languages.

FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as 'low-resource'. The data is derived from the FLoRes-101 dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native language. The recorded audio data is paired with the sentence transcriptions to yield multilingual speech recognition over all 101 languages. The training sets contain approximately 10 hours of supervised audio-transcription data per language.

Speech translation is the task of mapping from spoken speech to written text, where the speech and text are in different languages (e.g. English speech to French text).

CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla's open-source Common Voice database of crowd-sourced voice recordings. There are 2,900 hours of speech represented in the corpus.

FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as 'low-resource'. The data is derived from the FLoRes-101 dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native languages. An n n n -way parallel corpus of speech translation data is constructed by pairing the recorded audio data with the sentence transcriptions for each of the 101 languages. The training sets contain approximately 10 hours of supervised audio-transcription data per source-target language combination.

Audio classification is the task of mapping a raw audio input to a class label output. Practical applications of audio classification include keyword spotting, speaker intent and language identification.

SpeechCommands

SpeechCommands is a dataset comprised of one-second audio files, each containing either a single spoken word in English or background noise. The words are taken from a small set of commands and are spoken by a number of different speakers. The dataset is designed to help train and evaluate small on-device keyword spotting systems.

Multilingual Spoken Words

Multilingual Spoken Words is a large-scale corpus of one-second audio samples, each containing a single spoken word. The dataset consists of 50 languages and more than 340,000 keywords, totalling 23.4 million one-second spoken examples or over 6,000 hours of audio. The audio-transcription data is sourced from the Mozilla Common Voice project. Time stamps are generated for every utterance on the word-level and used to extract individual spoken words and their corresponding transcriptions, thus forming a new corpus of single spoken words. The dataset's intended use is academic research and commercial applications in multilingual keyword spotting and spoken term search.

FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) is a dataset for evaluating speech recognition systems in 102 languages, including many that are classified as 'low-resource'. The data is derived from the FLoRes-101 dataset, a machine translation corpus with 3001 sentence translations from English to 101 other languages. Native speakers are recorded narrating the sentence transcriptions in their native languages. The recorded audio data is paired with a label for the language in which it is spoken. The dataset can be used as an audio classification dataset for language identification : systems are trained to predict the language of each utterance in the corpus.

In this blog post, we explored the Hugging Face Hub and experienced the Dataset Preview, an effective means of listening to audio datasets before downloading them. We loaded an audio dataset with one line of Python code and performed a series of generic pre-processing steps to prepare it for a machine learning model. In total, this required just 13 lines of code, relying on simple Python functions to perform the necessary operations. We introduced streaming mode, a method for loading and preparing samples of audio data on the fly. We concluded by summarising the most popular speech recognition, speech translation and audio classification datasets on the Hub.

Having read this blog, we hope you agree that 🤗 Datasets is the number one place for downloading and preparing audio datasets. 🤗 Datasets is made possible through the work of the community. If you would like to contribute a dataset, refer to the Guide for Adding a New Dataset .

Thank you to the following individuals who help contribute to the blog post: Vaibhav Srivastav, Polina Kazakova, Patrick von Platen, Omar Sanseviero and Quentin Lhoest.

More Articles from our Blog

sample rate for speech recognition

The 5 Most Under-Rated Tools on Hugging Face

By  derek-thomas August 22, 2024 • 54

sample rate for speech recognition

Introduction to ggml

By  ngxson August 13, 2024 • 80

sample rate for speech recognition

  • Get Started

Learn about PyTorch’s features and capabilities

Learn about the PyTorch foundation

Join the PyTorch developer community to contribute, learn, and get your questions answered.

Learn how our community solves real, everyday machine learning problems with PyTorch.

Find resources and get questions answered

Find events, webinars, and podcasts

A place to discuss PyTorch code, issues, install, research

Discover, publish, and reuse pre-trained models

  • Speech Recognition with Wav2Vec2 >
  • Old version (stable)

Click here to download the full example code

Speech Recognition with Wav2Vec2 ¶

Author : Moto Hira

This tutorial shows how to perform speech recognition using using pre-trained models from wav2vec 2.0 [ paper ].

The process of speech recognition looks like the following.

Extract the acoustic features from audio waveform

Estimate the class of the acoustic features frame-by-frame

Generate hypothesis from the sequence of the class probabilities

Torchaudio provides easy access to the pre-trained weights and associated information, such as the expected sample rate and class labels. They are bundled together and available under torchaudio.pipelines module.

Preparation ¶

Creating a pipeline ¶.

First, we will create a Wav2Vec2 model that performs the feature extraction and the classification.

There are two types of Wav2Vec2 pre-trained weights available in torchaudio. The ones fine-tuned for ASR task, and the ones not fine-tuned.

Wav2Vec2 (and HuBERT) models are trained in self-supervised manner. They are firstly trained with audio only for representation learning, then fine-tuned for a specific task with additional labels.

The pre-trained weights without fine-tuning can be fine-tuned for other downstream tasks as well, but this tutorial does not cover that.

We will use torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H here.

There are multiple pre-trained models available in torchaudio.pipelines . Please check the documentation for the detail of how they are trained.

The bundle object provides the interface to instantiate model and other information. Sampling rate and the class labels are found as follow.

Model can be constructed as following. This process will automatically fetch the pre-trained weights and load it into the model.

Loading data ¶

We will use the speech data from VOiCES dataset , which is licensed under Creative Commos BY 4.0.

To load data, we use torchaudio.load() .

If the sampling rate is different from what the pipeline expects, then we can use torchaudio.functional.resample() for resampling.

torchaudio.functional.resample() works on CUDA tensors as well.

When performing resampling multiple times on the same set of sample rates, using torchaudio.transforms.Resample might improve the performace.

Extracting acoustic features ¶

The next step is to extract acoustic features from the audio.

Wav2Vec2 models fine-tuned for ASR task can perform feature extraction and classification with one step, but for the sake of the tutorial, we also show how to perform feature extraction here.

The returned features is a list of tensors. Each tensor is the output of a transformer layer.

Feature from transformer layer 1, Feature from transformer layer 2, Feature from transformer layer 3, Feature from transformer layer 4, Feature from transformer layer 5, Feature from transformer layer 6, Feature from transformer layer 7, Feature from transformer layer 8, Feature from transformer layer 9, Feature from transformer layer 10, Feature from transformer layer 11, Feature from transformer layer 12

Feature classification ¶

Once the acoustic features are extracted, the next step is to classify them into a set of categories.

Wav2Vec2 model provides method to perform the feature extraction and classification in one step.

The output is in the form of logits. It is not in the form of probability.

Let’s visualize this.

Classification result

We can see that there are strong indications to certain labels across the time line.

Generating transcripts ¶

From the sequence of label probabilities, now we want to generate transcripts. The process to generate hypotheses is often called “decoding”.

Decoding is more elaborate than simple classification because decoding at certain time step can be affected by surrounding observations.

For example, take a word like night and knight . Even if their prior probability distribution are differnt (in typical conversations, night would occur way more often than knight ), to accurately generate transcripts with knight , such as a knight with a sword , the decoding process has to postpone the final decision until it sees enough context.

There are many decoding techniques proposed, and they require external resources, such as word dictionary and language models.

In this tutorial, for the sake of simplicity, we will perform greedy decoding which does not depend on such external components, and simply pick up the best hypothesis at each time step. Therefore, the context information are not used, and only one transcript can be generated.

We start by defining greedy decoding algorithm.

Now create the decoder object and decode the transcript.

Let’s check the result and listen again to the audio.

The ASR model is fine-tuned using a loss function called Connectionist Temporal Classification (CTC). The detail of CTC loss is explained here . In CTC a blank token (ϵ) is a special token which represents a repetition of the previous symbol. In decoding, these are simply ignored.

Conclusion ¶

In this tutorial, we looked at how to use Wav2Vec2ASRBundle to perform acoustic feature extraction and speech recognition. Constructing a model and getting the emission is as short as two lines.

Total running time of the script: ( 0 minutes 4.399 seconds)

Download Python source code: speech_recognition_pipeline_tutorial.py

Download Jupyter notebook: speech_recognition_pipeline_tutorial.ipynb

Gallery generated by Sphinx-Gallery

  • Preparation
  • Creating a pipeline
  • Loading data
  • Extracting acoustic features
  • Feature classification
  • Generating transcripts

Access comprehensive developer documentation for PyTorch

Get in-depth tutorials for beginners and advanced developers

Find development resources and get your questions answered

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy .

  • torchvision
  • PyTorch on XLA Devices
  • PyTorch Foundation
  • Community Stories
  • Developer Resources
  • Models (Beta)
  • Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

What is the best sample rate for Google Speech API? Any Google employee or expert to comment on?

So far I have tested on a very little audio file with 16 kHz and 48 kHz. I would love to conduct much bigger tests but it costs money as you know.

48 kHz sample rate provided better results. However, on documentation it says best is 16 kHz

So I am a bit confused

Here the 16 kHz and 48 kHz flac files I have used to test with Google Speech to Text API

16 kHz : https://drive.google.com/file/d/1MbiW3t86W68ZqENtDqD4XdNmEV7QZbZA/view?usp=sharing

48 kHz : https://drive.google.com/file/d/1aLN1ptMJBwuYc6FdAk6CxcK1Ex4jI3vh/view?usp=sharing

And here the produced transcripts

Original sample rate of the video is 48 kHz

So any expert or employee can comment on this?

These are the 16 kHz and 48 kHz commands I used with ffmpeg to compose the flac file

  • google-cloud-platform
  • speech-to-text
  • google-speech-api
  • google-speech-to-text-api

Furkan Gözükara's user avatar

2 Answers 2

16 kHz is just the recommended sample rate to be used for transcribing Speech-to-Text. 1

We recommend a sample rate of at least 16 kHz in the audio files that you use for transcription with Speech-to-Text. Sample rates found in audio files are typically 16 kHz, 32 kHz, 44.1 kHz, and 48 kHz. Because intelligibility is greatly affected by the frequency range, especially in the higher frequencies, a sample rate of less than 16 kHz results in an audio file that has little or no information above 8 kHz. This can prevent Speech-to-Text from correctly transcribing spoken audio. Speech intelligibility requires information throughout the 2 kHz to 4 kHz range, although the harmonics (multiples) of those frequencies in the higher range are also important for preserving speech intelligibility. Therefore, keeping the sample rate to a minimum of 16 kHz is a good practice.

Rally H's user avatar

16k is the MINIMUM. Downsampling loses data. So, if your original is 48k - it's best to keep it.

Ronen Rabinovici's user avatar

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged google-cloud-platform google-api speech-to-text google-speech-api google-speech-to-text-api or ask your own question .

  • The Overflow Blog
  • From PHP to JavaScript to Kubernetes: how one backend engineer evolved over time
  • Where does Postgres fit in a world of GenAI and vector databases?
  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites
  • Feedback requested: How do you use tag hover descriptions for curating and do...
  • What does a new user need in a homepage experience on Stack Overflow?
  • Staging Ground Reviewer Motivation

Hot Network Questions

  • How do you hide an investigation of alien ruins on the moon during Apollo 11?
  • Why does my PC take a long time to start, then when it's at the login screen it jumps to the desktop instantly?
  • Deviation from the optimal solution for Solomon instances of CVRPTW
  • Why doesn't the world fill with time travelers?
  • Is the theory of ordinals in Cantor normal form with just addition decidable?
  • Amount Transfer Between Different Accounts
  • What did the Ancient Greeks think the stars were?
  • Why did General Leslie Groves evade Robert Oppenheimer's question here?
  • Why do National Geographic and Discovery Channel broadcast fake or pseudoscientific programs?
  • Replacing a multi character pattern that includes a newline with some characters
  • Can you find what these letters might be?
  • How do I alter a table by using AFTER in hook update?
  • How can you trust a forensic scientist to have maintained the chain of custody?
  • Is 2'6" within the size constraints of small, and what would the weight of a fairy that size be?
  • Can I retain the ordinal nature of a predictor while answering a question about it that is inherently binary?
  • How to use and interpret results from glmer() in R, when the predicted risks are lower than observed
  • High CPU usage by process with obfuscated name on Linux server – Potential attack?
  • How long does it take to achieve buoyancy in a body of water?
  • What is the spiritual difference between hungering and thirsting? (Matthew 5:6)
  • How do I safely remove a mystery cast iron pipe in my basement?
  • What are the risks of a compromised top tube and of attempts to repair it?
  • Are there different conventions for 'rounding to even'?
  • What is the significance of bringing the door to Nippur in the Epic of Gilgamesh?
  • How could Bangladesh protect itself from Indian dams and barrages?

sample rate for speech recognition

  • About AssemblyAI

DeepSpeech for Dummies - A Tutorial and Overview

What is DeepSpeech and how does it work? This post shows basic examples of how to use DeepSpeech for asynchronous and real time transcription.

DeepSpeech for Dummies - A Tutorial and Overview

Contributor

What is DeepSpeech? DeepSpeech is a neural network architecture first published by a research team at Baidu . In 2017, Mozilla created an open source implementation of this paper - dubbed “ Mozilla DeepSpeech ”.

The original DeepSpeech paper from Baidu popularized the concept of “end-to-end” speech recognition models. “End-to-end” means that the model takes in audio, and directly outputs characters or words. This is compared to traditional speech recognition models, like those built with popular open source libraries such as Kaldi or CMU Sphinx, that predict phonemes, and then convert those phonemes to words in a later, downstream process.

The goal of “end-to-end” models, like DeepSpeech, was to simplify the speech recognition pipeline into a single model. In addition, the theory introduced by the Baidu research paper was that training large deep learning models, on large amounts of data, would yield better performance than classical speech recognition models.

Today, the Mozilla DeepSpeech library offers pre-trained speech recognition models that you can build with, as well as tools to train your own DeepSpeech models. Another cool feature is the ability to contribute to DeepSpeech’s public training dataset through the Common Voice project.

In the below tutorial, we’re going to walk you through installing and transcribing audio files with the Mozilla DeepSpeech library (which we’ll just refer to as DeepSpeech going forward).

Basic DeepSpeech Example

DeepSpeech is easy to get started with. As discussed in our overview of Python Speech Recognition in 2021 , you can download, and get started with, DeepSpeech using Python’s built-in package installer, pip. If you have cURL installed, you can download DeepSpeech’s pre-trained English model files from the DeepSpeech GitHub repo as well. Notice that the files we’re downloading below are the ‘.scorer’ and ‘.pbmm’ files.

A quick heads up - when using DeepSpeech, it is important to consider that only 16 kilohertz (kHz) .wav files are supported as of late September 2021.

Let’s go through some example code on how to asynchronously transcribe speech with DeepSpeech. If you’re using a Unix distribution, you’ll need to install Sound eXchange (sox). Sox can be installed by using either ‘apt’ for Ubuntu/Debian or ‘dnf’ for Fedora as shown below.

Now let’s also install the Python libraries we’ll need to get this to work. We’re going to need the DeepSpeech library, webrtcvad for voice activity detection, and pyqt5 for accessing multimedia (sound) capabilities on desktop systems. Earlier, we already installed DeepSpeech, we can install the other two libraries with pip like so:

Now that we have all of our dependencies, let’s create a transcriber. When we’re finished, we will be able to transcribe any ‘.wav’ audio file just like the example shown below.

sample rate for speech recognition

Before we get started on building our transcriber, make sure the model files we downloaded earlier are saved in the ‘./models’ directory of the working directory. The first thing we’re going to do is create a voice activity detection (VAD) function and use that to extract the parts of the audio file that have voice activity.

How can we create a VAD function? We’re going to need a function to read in the ‘.wav’ file, a way to generate frames of audio, and a way to create a buffer to collect the parts of the audio that have voice activity. Frames of audio are objects that we construct that contain the byte data of the audio, the timestamp in the total audio, and the duration of the frame. Let’s start by creating our wav file reader function.

All we need to do is open the file given, assert that the channels, sample width, sample rate are what we need, and finally get the frames and return the data as PCM data along with the sample rate and duration. We’ll use ‘contextlib’ to open, read, and close the wav file.

We’re expecting audio files with 1 channel, a sample width of 2, and a sample rate of either 8000, 16000, or 32000. We calculate duration as the number of frames divided by the sample rate.

Now that we have a way to read in the wav file, let’s create a frame generator to generate individual frames containing the size, timestamp, and duration of a frame. We’re going to generate frames in order to ensure that our audio is processed in reasonably sized clips and to separate out segments with and without speech.

Looking for more tutorials like this?

Subscribe to our newsletter!

The below generator function takes the frame duration in milliseconds, the PCM audio data, and the sample rate as inputs. It uses that data to create an offset starting at 0, a frame size, and a duration. While we have not yet produced enough frames to cover the entire audio file, the function will continue to yield frames and add to our timestamp and offset.

After being able to generate frames of audio, we’ll create a function called vad_collector to separate out the parts of audio with and without speech. This function requires an input of the sample rate, the frame duration in milliseconds, the padding duration in milliseconds, a webrtcvad.Vad object, and a collection of audio frames. This function, although not explicitly called as such, is also a generator function that generates a series of PCM audio data.

The first thing we’re going to do in this function is get the number of padding frames and create a ring buffer with a dequeue. Ring buffers are most commonly used for buffering data streams.

We’ll have two states, triggered and not triggered, to indicate whether or not the VAD collector function should be adding frames to the list of voiced frames or yielding that list in bytes.

Starting with an empty list of voiced frames and a not triggered state, we loop through each frame. If we are not in a triggered state, and the frame is decided to be speech, then we add it to the buffer. If after this addition of the new frame to the buffer more than 90% of the buffer is decided to be speech, we enter the triggered state, appending the buffered frames to voiced frames and clearing the buffer.

If the function is already in a triggered state when we process a frame, then we append that frame to the voiced frames list regardless of whether it is speech or not. We then append it, and the truth value for whether it is speech or not, to the buffer. After appending to the buffer, if the buffer is more than 90% non-speech, then we change our state to not-triggered, yield voiced frames as bytes, and clear both the voiced frames list and the ring buffer. If, by the end of the frames, there are still frames in voiced frames, yield them as bytes.

That’s all we need to do to make sure that we can read in our wav file and use it to generate clips of PCM audio with voice activity detection. Now let’s create a segment generator that will return more than just the segment of byte data for the audio, but also the metadata needed to transcribe it. This function requires only one parameter, the ‘.wav’ file. It is meant to filter out all the audio frames that it does not detect voice on, and return the parts of the audio file with voice. The function returns a tuple of the segments, the sample rate of the audio file, and the length of the audio file.

Now that we’ve handled the wav file and have created all the functions necessary to turn a wav file into segments of voiced PCM audio data that DeepSpeech can process, let’s create a way to load and resolve our models.

We’ll create two functions called load_model and resolve_models . Intuitively, the load_model function loads a model, returning the DeepSpeech object, the model load time, and the scorer load time. This function requires a model and a scorer. This function calculates the time it takes to load the model and scorer via the timer() module from Python. It also creates a DeepSpeech ‘Model’ object from the ‘model’ parameter passed in.

The resolve models function takes a directory name indicating which directory the models are in. Then it grabs the first file ending in ‘.pbmm’ and the first file ending in ‘.scorer’ and loads them as the models.

Being able to segment out the speech from our wav file, and load up our models, is all the preprocessing we need leading up to doing the actual Speech-to-Text conversion.

Let’s now create a function that will allow us to transcribe our speech segments . This function will have three parameters: the DeepSpeech object (returned from load_models), the audio file, and fs the sampling rate of the audio file. All it does, other than keep track of processing time, is call the DeepSpeech object’s stt function on the audio.

sample rate for speech recognition

Alright, all our support functions are ready to go, let’s do the actual Speech-to-Text conversion.

In our “main” function below we’ll go ahead and directly provide a path to the models we downloaded and moved to the ‘./models’ directory of our working directory at the beginning of this tutorial.

We can ask the user for the level of aggressiveness for filtering out non-voice, or just automatically set it to 1 (from a scale of 0-3). We’ll also need to know where the audio file is located.

After that, all we have to do is use the functions we made earlier to load and resolve our models, load up the audio file, and run the Speech-to-Text inference on each segment of audio. The rest of the code below is just for debugging purposes to show you the filename, the duration of the file, how long it took to run inference on a segment, and the load times for the model and the scorer.

The function will save your transcript  to a ‘.txt’ file, as well as output the transcription in the terminal.

That’s it! That’s all we have to do to use DeepSpeech to do Speech Recognition on an audio file. That’s a surprisingly large amount of code. A while ago, I also wrote an article on how to do this in much less code with the AssemblyAI Speech-to-Text API. You can read about how to do Speech Recognition in Python in under 25 lines of code if you don’t want to go through all of this code to use DeepSpeech.

Basic DeepSpeech Real-Time Speech Recognition Example

Now that we’ve seen how we can do asynchronous Speech Recognition with DeepSpeech, let’s also build a real time Speech Recognition example. Just like before, we’ll start with installing the right requirements. Similar to the asynchronous example above, we’ll need webrtcvad, but we’ll also need pyaudio, halo, numpy, and scipy.

Halo is for an indicator that the program is streaming, numpy and scipy are used for resampling our audio to the right sampling rate.

How will we build a real time Speech Recognition program with DeepSpeech? Just as we did in the example above, we’ll need to separate out voice activity detected segments of audio from segments with no voice activity. If the audio frame has voice activity, then we’ll feed it into the DeepSpeech model to be transcribed.

Let’s make an object for our voice activity detected audio frames, we’ll call it VADAudio (voice activity detection audio). To start, we’ll define the format, the rate, the number of channels, and the number of frames per second for our class.

Every class needs an __init__ function. The __init__ function for our VADAudio class, defined below, will take in four parameters: a callback, a device, an input rate, and a file. Everything but the input_rate will default to None if they are not passed at creation.

The input sampling rate will be the rate sampling process we defined in our class above. When we initialize our class, we will also create an instance method called proxy_callback which returns a tuple of None and the pyAudio signal to continue, but before it returns it calls the callback function, hence the name proxy_callback .

Upon initialization, the first thing we do is set ‘callback’ to a function that puts the data into the buffer queue belonging to the object instance. We initialize an empty queue for the instance’s buffer queue. We set the device and input rate to the values passed in, and the sample rate to the Class’ sample rate. Then, we derive our block size and block size input as quotients of the Class’ sample rate and the input rate divided by the number of blocks per second respectively. Blocks are the discrete segments of audio data that we will work with.

Next, we create a PyAudio object and declare a set of keyword arguments. The keyword arguments are format , set to the VADAudio Class’ format value we declared earlier, channels , set to the Class’s channel value, rate , set to the input rate, input , set to true, frames_per_buffer set to the block size input calculated earlier, and stream_callback , set to the proxy_callback instance function we created earlier. We’ll also set our aggressiveness of filtering background noise here to the aggressiveness passed in, set to a default of 3, the highest filter. We set the chunk size to None for now. If there is a device passed into the initialization of the object, we set a new keyword argument, input_device_index to the device. The device is the input device used, but what we actually pass through will be the index of the device as defined by pyAudio, this is only necessary if you want to use an input device that is not the default input device of your computer. If there was not a device passed in and we passed in a file object, we change the chunk size to 320 and open up the file to read in as bytes. Finally, we open and start a PyAudio stream with the keyword arguments dictionary we made.

Our VADAudio Class will have 6 functions: resample, read_resampled, read, write_wav, a frame generator, and a voice activity detected segment collector. Let’s start by making the resample function. Due to limitations in technology, not all microphones support DeepSpeech’s native processing sampling rate. This function takes in audio data and an input sample rate, and returns a string of the data resampled into 16 kHz.

Next, we’ll make the read and read_resampled functions together because they do basically the same thing. The read function “reads” the audio data, and the read_resampled function will read the resampled audio data. The read_resampled function will be used to read audio that wasn’t sampled at the right sampling rate initially.

The write_wav function takes a filename and data. It opens a file with the filename and allows writing of bytes with a sample width of 2 and a frame rate equal to the instance’s sample rate, and writes the data as the frames before closing the wave file.

Before we create our frame generator, we’ll set a property for the frame duration in milliseconds using the block size and sample rate of the instance.

Now, let’s create our frame generator. The frame generator will either yield the raw data from the microphone/file, or the resampled data using the read and read_resampled functions from the Audio class. If the input rate is equal to the default rate, then it will simply read in the raw data, else it will return the resampled data.

The final function we’ll need in our VADAudio is a way to collect our audio frames. This function takes a padding in milliseconds, a ratio that controls when the function “triggers” similar to the one in the basic async example above, and a set of frames that defaults to None.

The default value for padding_ms is 300, and the default for the ratio is 0.75. The padding is for padding the audio segments, and a ratio of 0.75 here means that if 75% of the audio in the buffer is speech, we will enter the triggered state. If there are no frames passed in, we’ll call the frame generator function we created earlier. We’ll define the number of padding frames as the padding in milliseconds divided by the frame duration in milliseconds that we derived earlier.

The ring buffer for this example will use a dequeue with a max length of the number of padding frames. We will start in a not triggered state. We will loop through each of the frames, returning if we hit a frame with a length of under 640. As long as the length of the frame is over 640, we check to see if the audio contains speech.

Now, we execute the same algorithm we did above for the basic example in order to collect audio frames that contain speech. While not triggered, we append speech frames to the ring buffer, triggering the state if the amount of speech frames to the total frames is above the threshold or ratio we passed in earlier.

Once triggered, we yield each frame in the buffer and clear the buffer. In a triggered state, we immediately yield the frame, and then append the frame to the ring buffer. We then check the ring buffer for the ratio of non-speech frames to speech frames and if that is over our predefined ratio, we untrigger, yield a None frame, and then clear the buffer.

Alright - we’ve finished creating all the functions for the audio class we’ll use to stream to our DeepSpeech model and get real time Speech-to-Text transcription. Now it’s time to create a main function that we’ll run to actually do our streaming transcription.

First we’ll give our main function the location of our model and scorer. Then we’ll create a VADAudio object with aggressiveness, device, rate, and file passed in.

Using the vad_collector function we created earlier, we get the frames and set up our spinner/indicator. We use the DeepSpeech model we created from the model passed through the argument to create a stream. After initializing an empty byte array called wav_data , we go through each frame.

For each frame, if the frame is not None, we show a spinner spinning and then feed the audio content into our stream. If we’ve sent in the argument to save as a .wav file, then that file is also extended. If the frame is a None object, then we end the “utterance” and save the .wav file created, if we created one at all, and clear the byte array. Then we close the stream and open a new one.

Just like with the asynchronous Speech-to-Text transcription, the real-time transcription is an awful lot of code to do real time Speech Recognition. If you don’t want to manage all this code, you can check out our guide on how to do real time Speech Recognition in Python in much less code using the AssemblyAI Speech-to-Text API.

This ends part one of our DeepSpeech overview and tutorial. In this tutorial, we went over how to do basic Speech Recognition on a .wav file, and how to do Speech Recognition in real time, with DeepSpeech. Part two will be about training your own models with DeepSpeech, and how accurately it performs. It will be coming soon - so be on the lookout for that!

For more information, follow us @assemblyai and @yujian_tang on Twitter.- and subscribe to our newsletter .

Popular posts

AssemblyAI ALD confidence threshold

Announcements

Automatic language detection improvements: increased accuracy & expanded language support

JD Prater's picture

Head of Product Marketing

Build with AssemblyAI's Streaming Speech-to-Text + Latest Tutorials

Build with AssemblyAI's Streaming Speech-to-Text + Latest Tutorials

Smitha Kolan's picture

Developer Educator

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

Kelsey Foster's picture

Conversation intelligence: How to better understand the voice of the customer with Speech AI

Joseph Rendeiro's picture

Featured writer

Speech Emotion Recognition: An Empirical Analysis of Machine Learning Algorithms Across Diverse Data Sets

  • Conference paper
  • First Online: 20 August 2024
  • Cite this conference paper

sample rate for speech recognition

  • Mostafiz Ahammed   ORCID: orcid.org/0000-0003-2213-9241 8 ,
  • Rubel Sheikh   ORCID: orcid.org/0000-0002-6824-340X 9 ,
  • Farah Hossain 8 ,
  • Shahrima Mustak Liza 8 ,
  • Muhammad Arifur Rahman   ORCID: orcid.org/0000-0002-6774-0041 10 ,
  • Mufti Mahmud   ORCID: orcid.org/0000-0002-2037-8348 10 , 11 &
  • David J. Brown   ORCID: orcid.org/0000-0002-1677-7485 10  

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2065))

Included in the following conference series:

  • International Conference on Applied Intelligence and Informatics

22 Accesses

Communication is the way of expressing one’s feelings, ideas, and thoughts. Speech is a primary medium for communication. While people communicate with each other in several human interactive applications, such as a call center, entertainment, E-learning between teachers and students, medicine, and communication between clinicians and patients (especially important in the field of psychiatry), it is crucial to identify people’s emotions to better understand what they are feeling and how they might react in a range of situations. Automated systems are constructed to recognise emotions from analysis of speech or human voice using Artificial Intelligence (AI) or Machine Learning (ML) approaches, and these approaches are gaining momentum in recent research. This research aims to recognise a range of emotional states such as happy, sad, calm, angry, fear, disgust, surprise, or neutral from input speech signals with greater accuracy than currently seen in contemporary research. In order to achieve this aim, we have used the Support Vector Machine (SVM) classification algorithm and formed a feature vector by exploring speech features such as Mel Frequency Cepstral Coefficient (MFCC), Chroma, Mel-spectrogram, Spectral Centroid, Spectral Bandwidth, Spectral Roll-off, Root Mean Squared Energy (RMSE), and Zero Crossing Rate (ZCR) from speech signals. O. The system is tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Toronto Emotional Speech Set (TESS), and the Surrey Audio-Visual Expressed Emotion Database (SAVEE) datasets. Our proposed approach has achieved an overall accuracy of 99.59% on the RAVDESS dataset, 99.82% on the TESS dataset, and 98.95% on the SAVEE dataset for the SVM classifier. A mixed dataset is created from the three speech emotion datasets, which achieved significantly high classification accuracy compared with state-of-the-art methods. This model performs well on a large dataset, is ready to be tested with even bigger datasets, and can be used in a range of diverse applications, including education and clinical applications. GitHub: https://github.com/Mostafiz24/Speech-Emotion-Recognition .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Savee dataset, 10 December 2020. http://kahlan.eps.surrey.ac.uk/savee/Download.html

Sjtu Chinese emotional dataset, 12 December 2020. https://bcmi.sjtu.edu.cn/home/seed/

Emo-db dataset, 15 December 2020. http://emodb.bilderbar.info/docu/

How to make a speech emotion recognizer using python, 26 December 2020. https://www.thepythoncode.com/article/building-a-speech-emotion-recognizer-using-sklearn

Ravdess dataset, 5 December 2020. http://zenodo.org/record/1188976

Tess dataset, 8 December 2020. https://doi.org/10.5683/SP2/E8H2MF

Adiba, F.I., Islam, T., Kaiser, M.S., Mahmud, M., Rahman, M.A.: Effect of corpora on classification of fake news using Naive Bayes classifier. Int. J. Autom. Artif. Intell. Mach. Learn. 1 (1), 80–92 (2020). https://researchlakejournals.com/index.php/AAIML/article/view/45 , number: 1

Watile, A., Alagdeve, V., Jain, S.: Emotion recognition in speech by MFCC and SVM. Int. J. Sci. Eng. Technol. Res. (IJSETR) 6 (3) (2017)

Google Scholar  

Ali, H., Hariharan, M., Yaacob, S., Adom, A.H.: Facial emotion recognition using empirical mode decomposition. Expert Syst. Appl. 42 (3), 1261–1277 (2015)

Article   Google Scholar  

Bachu R.G., Kopparthi S., Adapa B., Barkana B.D.: Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. Adv. Tech. Comput. Sci. Softw. Eng. 279–282 (2015)

Bhavan, A., Chauhan, P., Hitkul, S.R.R.: Bagged support vector machines for emotion recognition from speech. Knowl. Based Syst. 184 , 104886 (2018). https://doi.org/10.1016/j.knosys.2019.104886

Biswas, M., Kaiser, M.S., Mahmud, M., Al Mamun, S., Hossain, M.S., Rahman, M.A.: An XAI based autism detection: the context behind the detection. In: Mahmud, M., Kaiser, M.S., Vassanelli, S., Dai, Q., Zhong, N. (eds.) BI 2021. LNCS (LNAI), vol. 12960, pp. 448–459. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86993-9_40

Chapter   Google Scholar  

Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13 (2), 293–303 (2005)

Das, S., Yasmin, M.R., Arefin, M., Taher, K.A., Uddin, M.N., Rahman, M.A.: Mixed Bangla-English spoken digit classification using convolutional neural network. In: Mahmud, M., Kaiser, M.S., Kasabov, N., Iftekharuddin, K., Zhong, N. (eds.) AII 2021. CCIS, vol. 1435, pp. 371–383. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-82269-9_29

Das, T.R., Hasan, S., Sarwar, S.M., Das, J.K., Rahman, M.A.: Facial spoof detection using support vector machine. In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds.) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. AISC, vol. 1309, pp. 615–625. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4673-4_50

Dhara, T., Singh, P.K., Mahmud, M.: A fuzzy ensemble-based deep learning model for EEG-based emotion recognition. Cogn. Comput. (2023). https://doi.org/10.1007/s12559-023-10171-2

Albornoz, E.M., Milone, D.H., Rufiner, H.L.: Spoken emotion recognition using hierarchical classifiers. Comput. Speech Lang. 25 (3), 556–570 (2011)

Avots, E., Sapiński, T., Bachmann, M., Kamińska, D.: Audiovisual emotion recognition in wild. Mach. Vis. Appl. 30 (5), 975–985 (2019). https://doi.org/10.1007/s00138-018-0960-9

Ferdous, H., Siraj, T., Setu, S.J., Anwar, M.M., Rahman, M.A.: Machine learning approach towards satellite image classification. In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds.) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. AISC, vol. 1309, pp. 627–637. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4673-4_51

Hasan, M.R., Jamil, M., Rahman, M.G.R.M.S.: Speaker identification using Mel frequency cepstral coefficient. In: 3rd International Conference on Electrical & Computer Engineering, pp. 28–30 (2004)

Cao, H., Verma, R., Nenkova, A.: Speaker-sensitive emotion recognition via ranking: studies on acted and spontaneous speech. Comput. Speech Lang. 28 (1), 186–202 (2015)

Jannat, R., Tynes, I., Lime, L.L., Adorno, J., Canavan, S.: Ubiquitous emotion recognition using audio and video data. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, Association for Computing Machinery pp. 956–959 (2018)

Rong, J., Li, G., Chen, Y.P.P.: Acoustic feature selection for automatic emotion recognition from speech. Inf. Process. Manag. 45 (3), 315–328 (2009)

Chen, L., Mao, X., Xue, Y., Cheng, L.L.: Speech emotion recognition: features and classification models. Digit. Signal Process. 22 (6), 1154–1160 (2012)

Article   MathSciNet   Google Scholar  

Kerkeni, L., et al.: Automatic emotion recognition using machine learning. Social Media and Machine Learning (March 2019)

Sun, L., Fu, S., Wang, F.: Decision tree SVM model with fisher feature selection for speech emotion recognition. EURASIP J. Audio Speech Music Process. (2019)

Liu, Z.T., Wu, M., Cao, W.H., Mao, J.W., Xu, J.P., Tan, G.Z.: Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273 , 271–280 (2017)

Mahmud, M., et al.: A brain-inspired trust management model to assure security in a cloud based IoT framework for neuroscience applications. Cogn. Comput. 10 (5), 864–873 (2018). https://doi.org/10.1007/s12559-018-9543-3

Mahmud, M., et al.: Towards explainable and privacy-preserving artificial intelligence for personalisation in autism spectrum disorder. In: Antona, M., Stephanidis, C. (eds.) Universal Access in Human-Computer Interaction. User and Context Diversity. HCII 2022. LNCS, vol. 13309, pp. 356–370. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05039-8_26

Mizan, M.B., et al.: Dimensionality reduction in handwritten digit recognition. In: Mahmud, M., Mendoza-Barrera, C., Kaiser, M.S., Bandyopadhyay, A., Ray, K., Lugo, E. (eds.) Proceedings of Trends in Electronics and Health Informatics. TEHI 2022. LNNS, vol. 675, pp. 35–50. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-1916-1_3

Nasrin, F., Ahmed, N.I., Rahman, M.A.: Auditory attention state decoding for the quiet and hypothetical environment: a comparison between bLSTM and SVM. In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds.) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. AISC, vol. 1309, pp. 291–301. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4673-4_23

Nawar, A., Toma, N.T., Al Mamun, S., Kaiser, M.S., Mahmud, M., Rahman, M.A.: Cross-content recommendation between movie and book using machine learning. In: 2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–6 (2021). https://doi.org/10.1109/AICT52784.2021.9620432

Sundarprasad, N.: Speech emotion detection using machine learning techniques. Masterś Projects (May 2018)

Prabhakaran, N.B.: Speech emotion recognition using deep learning. Int. J. Recent Technol. Eng. (IJRTE) 7 (2018)

Patel, N., Patel, S., Mankad, S.H.: Impact of autoencoder based compact representation on emotion detection from audio. J. Ambient. Intell. Humaniz. Comput. (2021). https://doi.org/10.1007/s12652-021-02979-3

Ragot, M., Martin, N., Em, S., Pallamin, N., Diverrez, J.-M.: Emotion recognition using physiological signals: laboratory vs. wearable sensors. In: Ahram, T., Falcão, C. (eds.) AHFE 2017. AISC, vol. 608, pp. 15–22. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60639-2_2

Rahman, M.A., et al.: Enhancing biofeedback-driven self-guided virtual reality exposure therapy through arousal detection from multimodal data using machine learning. Brain Inform. 10 , 1–18 (2023). https://doi.org/10.1186/s40708-023-00193-9

Rahman, M.A., Brown, D.J., Shopland, N., Burton, A., Mahmud, M.: Explainable multimodal machine learning for engagement analysis by continuous performance test. In: Antona, M., Stephanidis, C. (eds.) Universal Access in Human-Computer Interaction. User and Context Diversity. HCII 2022. LNCS, vol. 13309, pp. 386–399. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05039-8_28

Rahman, M.A., et al.: Towards machine learning driven self-guided virtual reality exposure therapy based on arousal state detection from multimodal data. In: Mahmud, M., He, J., Vassanelli, S., van Zundert, A., Zhong, N. (eds.) Brain Informatics. BI 2022. LNCS, vol. 13406, pp. 195–209. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-15037-1_17

Rakib, A.B., Rumky, E.A., Ashraf, A.J., Hillas, M.M., Rahman, M.A.: Mental healthcare chatbot using sequence-to-sequence learning and BiLSTM. In: Mahmud, M., Kaiser, M.S., Vassanelli, S., Dai, Q., Zhong, N. (eds.) Brain Informatics. BI 2021. LNCS, vol. 12960, pp. 378–387. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86993-9_34

Darekar, R.V., Dhande, A.P.: Emotion recognition from Marathi speech database using adaptive artificial neural network. Biol. Inspired Cogn. Archit. 35–42

Mekruksavanich, S., Jitpattanakul, A., Hnoohom, N.: Negative emotion recognition using deep learning for Thai language. In: The Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer, and Telecommunications Engineering (ECTI DAMT and NCON), pp. 71–74, 11–14 March 2020

Wu, S., Falk, T.H., Chan, W.Y.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53 (5), 768–785 (2011)

Sadik, R., Reza, M.L., Noman, A.A., Mamun, S.A., Kaiser, M.S., Rahman, M.A.: COVID-19 pandemic: a comparative prediction using machine learning. Int. J. Autom. Artif. Intell. Mach. Learn. 1 (1), 1–16 (2020). https://www.researchlakejournals.com/index.php/AAIML/article/view/44 , number: 1

Shahriar, M.F., Arnab, M.S.A., Khan, M.S., Rahman, S.S., Mahmud, M., Kaiser, M.S.: Towards Machine Learning-Based Emotion Recognition from Multimodal Data, January 2023. https://doi.org/10.1007/978-981-19-5191-6_9 ,

Shopland, N., et al.: Improving accessibility and personalisation for HE students with disabilities in two countries in the indian subcontinent - initial findings. In: Antona, M., Stephanidis, C. (eds.) Universal Access in Human-Computer Interaction. User and Context Diversity. HCII 2022. LNCS, vol. 13309, pp. 110–122. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05039-8_8

Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41 (4), 603–623 (2003)

TTomba, K., Dumoulin, J., Mugellini, E., Khaled, O.A., Hawila, S.: Stress detection through speech analysis. In: 15th International Joint Conference on e-Business and Telecommunications, vol. 1, ICETE, INSTICC, SciTePress, pp. 394–398 (2018)

Ke, X., Zhu, Y., Wen, L., Zhang, W.: Speech emotion recognition based on SVM and ANN. In. J. Mach. Learn. Comput. 8 (3) (2018)

Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6 , 2 (2012)

Download references

Author information

Authors and affiliations.

Department of CSE, Jahangirnagar University, Savar, Dhaka, Bangladesh

Mostafiz Ahammed, Farah Hossain & Shahrima Mustak Liza

Department of Educational Technology, Bangabandhu Sheikh Mujibur Rahman Digital University, Kaliakair, Bangladesh

Rubel Sheikh

Department of Computer Science, Nottingham Trent University, Nottingham, NG11 8NS, UK

Muhammad Arifur Rahman, Mufti Mahmud & David J. Brown

CIRC and MTIF, Nottingham Trent University, Nottingham, NG11 8NS, UK

Mufti Mahmud

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mostafiz Ahammed .

Editor information

Editors and affiliations.

Nottingham Trent University, Nottingham, UK

Higher Colleges of Technology, Dubai, United Arab Emirates

Hanene Ben-Abdallah

Jahangirnagar University, Dhaka, Bangladesh

M. Shamim Kaiser

Military Technological College, Muscat, Oman

Muhammad Raisuddin Ahmed

Maebashi Institute of Technology, Gunma, Japan

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Ahammed, M. et al. (2024). Speech Emotion Recognition: An Empirical Analysis of Machine Learning Algorithms Across Diverse Data Sets. In: Mahmud, M., Ben-Abdallah, H., Kaiser, M.S., Ahmed, M.R., Zhong, N. (eds) Applied Intelligence and Informatics. AII 2023. Communications in Computer and Information Science, vol 2065. Springer, Cham. https://doi.org/10.1007/978-3-031-68639-9_3

Download citation

DOI : https://doi.org/10.1007/978-3-031-68639-9_3

Published : 20 August 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-68638-2

Online ISBN : 978-3-031-68639-9

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

sample rate for speech recognition

Frequently Asked Questions

Background .

Speech Recognition Engines need Acoustic Models trained with speech audio that has the same sampling rate and bits per sample as the speech it will recognize.  The different speech mediums have limitations that affect speech recognition.

Telephony Bandwidth Limitations 

For example, for telephony speech recognition, the limitation is the 64kbps bandwidth of a telephone line.  This only permits a sampling rate of 8kHz and a sampling resolution of 8-bits per sample. Therefore, to perform speech recognition on a telephone line, you need Acoustic Models trained using audio recorded at an 8kHz sampling rate with 8-bits per sample.  VoIP applications usually have the same limitations since they allow interconnection to Public Service Telephone Network (PSTN).

Desktop Sound Card and Processor Limitations 

For desktop Command and Control applications,  your PC's sound card determines your maximum sampling rate and bits per sample, and the power of your CPU determines what kinds of acoustic models your Speech Recognition Engine can process efficiently.

So why record at highest sampling/bits per sample rates?

Speech Recognition Engines work best with Acoustic Models trained with audio recorded at higher sampling rate and bits per sample.  However, since current hardware (CPUs and/or sound cards) is not powerful enough to support Acoustic Models trained at higher sampling rates and bits per sample, and telephony applications have bandwidth limitations (as discussed above), a compromise is required.  VoxForge has decided that the best approach (for now) is to collect speech recorded at the highest sampling rate your audio card support, at 16-bits per sample, and then downsample the audio to sampling rates that can be supported by the speech medium

For example, for Command and Control applications on a desktop PC, you can downsample the 48kHz/16-bit audio to 16kHz/16-bit audio, and create Acoustic Models from this.  This approach permits us to be backward compatible with older Sound Cards that may not support the higher sampling rates/bits per sample, and also permit us to look to the future so that any submitted audio at higher sampling rates/bits per sample will be usable down the road when Sound Cards that support higher sampling rates/bits per sample will become more common, and processing power increases.

For Telephony applications, to create Acoustic Models from audio recorded at a sample rate of 48kHz with 16-bits per sample, you must first downsample the audio to a sample rate of 8kHz/8-bit per sample, and then create an Acoustic Model from this.

Some VoIP PBXs, such as Asterisk, actually represent audio data internally at 8kHz/16-bit sampling rates, even though the codec used might only support 8kHz/8-bit sampling rates.  Therefore VoIP PBX's like Asterisk can use Acoustic Models trained on audio with8kHz/16-bit sampling rates.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Does speaker recognition system need the same sample rate audio?

I'm a new man in speaker recognition project. I've train a model with some .wav files, its sample rate is 16000.

I want to test this model with my record, but my record's sample rate is 8000

Doed it need the same sample rate for training and testing speaker recognition system?

Hope to have your answers,

  • speech-recognition

Can Nguyen's user avatar

  • $\begingroup$ Can I please ask what sort of features does your model use? $\endgroup$ –  A_A Commented Feb 28, 2018 at 9:23
  • $\begingroup$ @A_A It's MFCCs $\endgroup$ –  Can Nguyen Commented Feb 28, 2018 at 9:26
Does it need the same sample rate for training and testing speaker recognition system?

I suppose that you are asking whether or not the difference in sampling frequency ($Fs$) will affect your classification result.

The short answer is "It depends on the data". The slightly longer answer is:

The $Fs$ has a direct impact on the captured bandwidth. At 8kHz, you are capturing (nominally) 0-4kHz. At 16kHz, you are capturing (nominally) 0-8kHz.

Speaker recognition is essentially trying to discriminate between the spectral differences in the sound of different speakers.

Therefore, a trivial example is one where if your $Fs$ is 8kHz and the key variation of spectral differences occurs between 6-7kHz, then your classifier will completely fail to discriminate between its classes.

Another detail is being sure that the $Fs$ is suitable, i.e. it includes the frequency range of interest, but , the recordings are awful and some other sound source is masking the range of frequencies the classifier depends on.

Therefore, "It depends on the data".

But, if we assume that the recordings are not problematic and a lower $Fs$ does indeed contain the frequency band of interest, then what might be worth considering more is the size of the transforms that are involved in the derivation of the MFCC that your classifier depends on. In other words, lowering the $Fs$ might require that you increase the length of the transforms to maintain the same (or similar) frequency resolution. Otherwise, adjacent frequency bins might get lumped together and reduce the clarity of the observed spectrum.

Finally, a more practical note would be to make sure that any existing code you might be using does not contain hard coded $Fs$. If the original author has assumed a fixed $Fs$ throughout the implementation of the model, it is likely that the classifier will have unpredictable behaviour if you feed it with sound files of different characteristics.

Hope this helps.

A_A's user avatar

  • $\begingroup$ Thanks so much for your reply. It helps me a lot. For clarification, if I make sure that I'm doing the right way, avoid problems that you said, it's OKE to use 8kHz audio file to test a model created with 16kHz. Is it right?? $\endgroup$ –  Can Nguyen Commented Feb 28, 2018 at 10:40
  • $\begingroup$ @CanNguyen You are asking me the same question that your original question does and I cannot but refer you back to the same answer. Yes it will be alright provided that you take care of the details mentioned in the answer. $\endgroup$ –  A_A Commented Feb 28, 2018 at 11:34

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged sampling speech-recognition or ask your own question .

  • The Overflow Blog
  • From PHP to JavaScript to Kubernetes: how one backend engineer evolved over time
  • Where does Postgres fit in a world of GenAI and vector databases?
  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Replacing a multi character pattern that includes a newline with some characters
  • Is it possible to create a board position where White must make the move that leads to stalemating Black to avoid Black stalemating White?
  • A string view over a Java String
  • Why does Russia strike electric power in Ukraine?
  • Will this be the first time that there are more people aboad the ISS than seats in docked spacecraft?
  • Submitting a paper as a nonacademic practitioner in a field
  • Validity of ticket when using alternative train from a different station
  • Can I use "historically" to mean "for a long time" in "Historically, the Japanese were almost vegetarian"?
  • Why are AAVs not a replacement for through-roof vent pipe?
  • ApiVersion 61.0 changes behaviour of inheritance (cannot override private methods in inner class)
  • sudo / sudoers on macOS: regex not working but wildcards/globs are
  • DATEDIFF Rounding
  • How can I draw water level in a cylinder like this?
  • How do you determine what order to process chained events/interactions?
  • What issues are there with my perspective on truth?
  • Completely introduce your friends
  • Why is Emacs recompiling some packages on every startup?
  • How to justify our beliefs so that it is not circular?
  • High CPU usage by process with obfuscated name on Linux server – Potential attack?
  • How do you hide an investigation of alien ruins on the moon during Apollo 11?
  • What did the Ancient Greeks think the stars were?
  • Stealth Mosquitoes?
  • Order of connection using digital multimeter wall outlet
  • How do I safely remove a mystery cast iron pipe in my basement?

sample rate for speech recognition

Irina-ElenaVeliche \name ZhuangqunHuang \name VineethAyyat Kochaniyan \name FuchunPeng \name OzlemKalinli \name Michael L.Seltzer

Towards measuring fairness in speech recognition: Fair-Speech dataset

The current public datasets for speech recognition (ASR) tend not to focus specifically on the fairness aspect, such as performance across different demographic groups. This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands. We also provide ASR baselines, including on models trained on transcribed and untranscribed social media videos and open source models.

1 Introduction

The performance of current speech recognition (ASR) systems has improved significantly over the last few years, with the emergence of new modeling techniques and considerable amounts of training data. However, most of the improvements are targeted for overall word error rate (WER). The evaluation sets being used tend to lack information associated with the demographic characteristics, such as ethnicity, geographic variation and whether utterances come from native or non-native English speakers. Also, most numbers are reported in aggregate, without giving a more clear picture of the potential gaps between different demographic groups. While there have been many studies showing that ASR systems do not perform equally well for all demographic and accent groups [ 1 , 2 , 3 , 4 , 5 , 6 ] , the number of open sourced datasets that can be used for evaluation of such characteristics is limited.

In this paper we introduce a new ASR dataset, Fair-Speech. Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the U.S. who were paid to record and submit audio of themselves saying commands. They self-identified their demographic information, such as age, gender, ethnicity, geographic location and whether they consider themselves native English speakers, together with their first language.

The verbal commands included in this dataset are categorized into seven domains, primarily serving voice assistant use cases — music, capture, utilities, notification control, messaging, calling, and dictation — that can support researchers who are building or have models in those areas. In response to prompts that relate to each of these domains, dataset participants provided their own audio commands. Some examples of prompts were asking how they would search for a song or make plans with friends, including deciding where to meet. Providing broad prompts to guide the speakers is better than simply asking participants to read text prompts, since that tends to make the audios sound less natural: people would make different kinds of pauses than in natural speech and entities might also not be pronounced properly, if the participants are not familiar with them. Our dataset includes the audio and transcription of participants’ utterances, together with their self-identified labels across the different demographic categories. The intent is for this to be used for evaluating the performance of existing ASR models. The data user agreement prevents a user from developing models that predicts the value of those labels, but one may measure the performance of different models as a function of those labels.

By releasing this dataset, we hope to further motivate the AI community to continue improving the fairness of speech recognition models, which will help everyone have a better experience using applications with ASR.

2 Previous work on ASR Fairness

As voice recognition systems have become more integrated into daily lives, especially through the use of voice assistants, there has been a considerable amount of research showing that those systems exhibit biases when it comes to the performance of the ASR models. For example, [ 4 ] studied the ability of different ASR systems to transcribe structure interviews of black and white speakers, finding that all of them exhibited substantial racial disparities. The study was done on Corpus of regional African American language [ 7 ] , a collection of socio-linguistic interviews with dozens of black individuals who speak African American Vernacular English, and also on Voices of California [ 8 ] , which is a compilation of interviews recorded in both rural and urban areas of California. Prior studies also showed disparities across accents and socio-economic status of the speakers in [ 5 ] , race and gender bias in [ 9 , 10 , 11 , 12 ] , regional and non-native accent [ 6 ] .

There is also some recent work on how some of these demographic biases can be mitigated, such as in [ 3 , 13 , 14 , 15 ] . However, some are using in-house datasets, which are difficult to use for comparison.

There are a number of open sourced datasets that can be used for measure the fairness of ASR systems across different demographic groups. Apart from the two mentioned above [ 7 , 8 ] , there is also the Artie Bias corpus [ 16 ] , a curated set of the Mozilla Common Voice corpus, which contains demographic tags for age, gender, accent. Casual conversations dataset [ 17 ] has associated tags for gender, age and skin tone, while the ICSI meeting corpus [ 18 ] has associated information on gender, age, native language and education level.

The Fair-Speech dataset aims to provide data recorded in a free speech manner from a more diverse set of speakers, where participants self-identified across the different demographic categories.

3 Corpus contents

The verbal commands included in this dataset are categorized into seven domains, primarily serving voice assistant use cases — music, capture, utilities, notification control, messaging, calling, and dictation. In response to prompts that relate to each of these domains, dataset participants provided their own audio commands. Our dataset includes the audio and transcription of participants’ utterances. The audio is mobile collected. The intention of this dataset is to be used as an evaluation tool, to uncover gaps or biases in ASR models.

This dataset was constructed with the recordings of paid participants who have explicitly provided their consent for their recordings to be used in research, together with the associated demographic information. This ensures that the dataset aligns with the ethical standards e.g. for data collection, respects the privacy and autonomy of the participants, but also promotes transparency and other key ethical considerations in responsible data collection practices.

Table 1 shows the per-category distribution of the entire dataset, in terms of number of unique speakers and number of utterances for each demographic sub-group. For age we have a fairly good representation across 18 - 65 groups, with a larger percentage of utterances in the 31 - 45 bucket. The 66+ bucket had too few speakers and utterances, so we chose to not include it here. For gender distribution the percentage of utterances is more balanced. Since we didn’t have a significant number of utterances from people who identified as non-binary, we chose to not include them, to not show skewed results. In terms of ethnicity, we have a fairly good representation across multiple categories. The two categories where we have less representation are Native Hawaiian or other Pacific Islander, with 3.6% of total entries and Middle Eastern or North African, with 2.4%. They also have less number of speakers than the other categories. In terms of geographic variation, a bit more than half of the utterances are from people who earn less than $50k per year and there are only 7% from people who earn more than $100k, with only 50 speakers in that sub-group. In terms of linguistic variation, we split the utterances based on native language, whether it’s English or not. There is about 80% of the data coming from people whose native language is English. For the other bucket, we provide the first language in the dataset as well. After English, most utterances are from Spanish and Mandarin speakers, with other languages represented in smaller percentages as well.

84 3846
95 4168
285 12770
129 5687
321 14422
272 12049
82 3854
180 7807
63 2814
17 749
105 4632
22 969
124 5646
335 14780
217 9824
41 1867
482 21528
111 4943

3.1 Breakdown by gender for demographic categories

In Figure  1 we show the breakdowns by gender for each of the demographic categories. While for some categories there is a good balance between the different genders, for example in the 18 - 22 sub-group, for some there are many more utterances coming from male than from female speakers, such as in the Black or African American sub-group, or the other way around, such as in the Asian, South Asian or Asian-American sub-group. This is important to take into account when analyzing results. Therefore, in section 5.1 we’ll take into account confounding factors and speaker variability when showing WER gaps between different sub-groups.

Refer to caption

3.2 Data transcription

To produce transcription useful for ASR evaluation, all data was verbatim transcribed. Colloquial words were kept as spoken, as well as repeated words. Numbers were spelled out as spoken and entities were transcribed with capital letter. All the other text is lower-cased, without punctuation. Since this is a voice command dataset, each audio contains utterances from a single speaker. The length of the utterances vary. Each recording lasts an average of 7.36 seconds. The maximum length of an utterance is around 1 minute.

High-quality annotations were achieved through a multi-pass human transcription, resolution, curation, and scoring. The multi-pass annotation by external vendors ran until a three-way agreement. The  10% without agreement was resolved by internal linguistic engineers. Data curation started from a side by side audit comparing annotations with ASR hypotheses coming from models with different sizes. The  650 rows with low confidence were further audited by cultural-specific linguistic vendors. All audited data was reviewed by multiple internal teams for cross-functional validation. Finally, 200 random rows were scored by an internal linguistic expert to obtain the annotation WER (1.47%).

4 Speech recognition experiments

To provide some ASR baselines on the ASR fairness dataset on, we trained a series of recurrent neural network transducer (RNN-T) [ 19 ] models and also used open-sourced models.

1. Video model, supervised only data: an RNN-T model with an emformer encoder [ 20 ] , LSTM predictor and a joiner, having approximately 50 million parameters in total. The input feature stride is 6. Encoder network has 13 emformer layers, each with embedded dimension of 480, 4 attention heads, FFN size of 2048. Prediction network is an LSTM layer with 256 hidden units and dropout 0.3. Joint network has 1024 hidden units and a softmax layer of 4096 units for blank and wordpieces. The model was trained on 29.8K hours of English video data that is completely de-identified before transcription. It contains a diverse range of speakers, accents, topics and acoustic conditions. We apply distortion and additive noise to the speed-perturbed data. This results in a total of 148.9K hours of training data.

2. Video model, semi-supervised: a streaming emformer model trained on over 2 million hours of social media videos. 29.8K hours are manually transcribed as described above and the rest is unlabeled data, decoded by larger teacher models. The model has approximately 290M parameters.

3. Whisper [ 21 ] : transformer based model, trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio. We evaluate on both the large-v2 model, which has 1550M parameters and small, which has 244M parameters.

We computed WER on the Fair-Speech dataset using the four models described above. We also noted the relative gap between the groups with the lowest and highest WER in each category. As can be seen in Tables 2 , there are gaps across all demographic groups. Some of the key observations:

The data used in training can have a significant impact on the WER, particularly when it comes to bias between different demographic groups. The relative gap across all the demographic categories for all evaluated models is in double digits for most of the categories. For Whisper models the gap is larger than 40% across all dimensions.

Adding more data in training can significantly improve the performance of the model, such as when semi-supervised data was added for the video models. Interestingly however, for geographic variation, even though the WER improves with more data, the relative gap becomes wider. The linguistic variation difference almost disappears when more data is added in the training, making the dataset more diverse. Also for ethnicity, while the supervised model has some large gaps between different groups, after adding semi-supervised data the gaps decrease and the group with the highest WER actually changes.

As expected, having a larger model improves the accuracy across all the categories, as can be seen with the two Whisper models. However, the relative gap in the linguistic variation sub-group is actually increasing for the larger model.

Data might not be enough to achieve a fair model. All the models shown here were trained on more than 1 million hours of data. However, they exhibit significant gaps across each of the demographic sub-groups. Thus, new modeling techniques are needed to focus on improving the performance for all people. Also, during evaluation, a particular focus needs to be given to demographic breakdowns in addition to overall model accuracy.

6.52 3.79 5.63 4.46
7.93 4.13 6.74 4.62
11.46 5.16 12.47 7.48
6.94 4.62 5.05 3.65
43.1% 26.55% 59.5% 51.2%
6.76 3.82 5.16 3.86
12.06 5.75 13.3 7.91
43.94% 33.56% 61.2% 51.2%
6.75 4.21 4.93 3.7
14.21 4.9 16.99 9.52
7.68 5.09 5.84 3.9
8.13 3.67 7.99 5.08
7.15 4.13 5.3 4.12
6.47 3.51 5.99 3.84
6.29 4.03 4.51 3.96
55.73% 31.04% 73.45% 61.13%
8.67 4.95 7.69 5.4
10.13 4.54 10.94 6.38
6.99 3.07 5.55 3.62
30.99% 37.97% 49.26% 43.26%
9.51 4.66 9.54 6.11
7.49 4.71 5.63 3.78
21.24% 1.06% 40.98% 46.83%

There can be many nuances when interpreting these WER gaps, due to speaker variability, how many samples we have and confounding factors. Thus, we use an model-based approach to measure fairness, that takes into account all these factors and provides a more accurate picture of the statistical significance of the results.

5.1 Understanding the WER gaps

When analyzing the results, we employed a model-based approach to measure fairness, using mixed-effects Poisson regression to interpret any WER differences between subgroups of interest, as described in [ 22 ] . This helps by taking into account nuisance factors, unobserved heterogeneity across speakers and helps tracing the source of WER gaps between different subgroups. For this analysis, we used the video semi-supervised model.

We apply the model-based approach, where we fit a mixed-effects Poisson regression with the demographic group we focus on (age, gender etc.) as the fixed effect and speaker label as a random effect.

When computing the fairness measurement of speech recognition accuracy among different subgroups of the factor f ⁢ ( ⋅ ) 𝑓 ⋅ f(\cdot) italic_f ( ⋅ ) , the model is described as follows in [ 22 ] :

(1)
(2)
(3)

where the utterance-level index of subscription notation i ⁢ j 𝑖 𝑗 ij italic_i italic_j represents the j 𝑗 j italic_j th utterance from the i 𝑖 i italic_i th speaker, r i subscript 𝑟 𝑖 r_{i} italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the speaker-level random effect that is independently sampled from a Gaussian distribution with mean 0 and variance σ 2 superscript 𝜎 2 \sigma^{2} italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT which is learnable. μ f ⁢ ( i ) subscript 𝜇 𝑓 𝑖 \mu_{f(i)} italic_μ start_POSTSUBSCRIPT italic_f ( italic_i ) end_POSTSUBSCRIPT is used to denote the fixed effect for the factor f ⁢ ( ⋅ ) 𝑓 ⋅ f(\cdot) italic_f ( ⋅ ) of primary interest, since typically it is at speaker level.

The bootstrap method [ 23 ] is applied to compute the 95% confidence interval (CI) of the ratio. If the CI does not include the value of one effect, we assume that there is a statistically significant result.

Results are shown in Table  3 . For each demographic category, we do pairwise comparison across all subgroups. The rows in bold indicate that the WER diference between two subgroups are statistically significant, while the rest are insignificant.

For age, when we compare the 18 - 22 group, which has the lowest WER, with the other groups, we see statistically significant differences to groups 31 - 45 and 46 - 65, but not to group 23 - 30.

For gender, there is a statistically significant difference between Female and Male speakers, which is in line with sociolinguistic theory, that establishes that women tend to speak in a more standard way than men [ 24 ] .

For ethnicity, when we do the pairwise comparisons, there are statistically significant differences between the Black or African American subgroup and all the other subgroups. This is in line with previous research on racial disparities. [ 12 ] found that this is due to phonological, phonetic or prosodic characteristics of African American Vernacular English, rather than the grammatical or lexical characteristics.

In terms of geographic variation, there are statistically significant differences between Low and Medium and Medium and Affluent.

For linguistic variation, even though the difference in WER is quite small between the two sub-groups, we see statistically significant differences between people whose first language is English and those who have a different first language.

1.18 (0.98, 1.41)
1.43 (1.22, 1.68)
1.31 (1.06, 1.6)
1.21 (1.05, 1.4)
1.1 (0.92, 1.32)
0.91 (0.8, 1.02)
1.39 (1.26, 1.53)
1.81 (1.55, 2.11)
0.97 (0.76, 1.23)
1.04 (0.76, 1.43)
1.35 (1.1, 1.66)
1.16 (0.87, 1.54)
0.92 (0.76, 1.12)
0.54 (0.47, 0.62)
0.59 (0.49, 0.7)
0.74 (0.66, 0.83)
0.63 (0.54, 0.73)
0.51 (0.45, 0.57)
1.07 (0.76, 1.5)
1.38 (1.1, 1.74)
1.19 (0.88, 1.59)
0.94 (0.76, 1.16)
1.28 (0.88, 1.88)
1.08 (0.75, 1.56)
0.88 (0.61, 1.25)
0.85 (0.67, 1.08)
0.68 (0.58, 0.8)
0.8 (0.59, 1.08)
1.2 (1.08, 1.33)
1.14 (0.88, 1.48)
0.72 (0.62, 0.83)
0.82 (0.71, 0.94)

6 Conclusion

In this paper, we introduced a new ASR dataset, Fair-Speech, that has metadata attached for different demographic groups (age, gender, ethnicity, geographic and linguistic variation) and can be used for fairness evaluation when developing speech recognition models. We also run baseline analysis on different models and found that there are statistically significant gaps across the different sub-groups for each demographic category. The datasets, with transcripts and metadata, are all released to the external community. We hope that they will help in evaluating and improving the fairness of speech recognition models.

  • [1] R. Tatman, “Gender and dialect bias in youtube’s automatic captions,” in EthNLP@EACL , 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:13997424
  • [2] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness in machine learning,” ACM Comput. Surv. , vol. 54, no. 6.
  • [3] P. Dheram, M. Ramakrishnan, A. Raju, I.-F. CHEN, B. King, K. Powell, M. Saboowala, K. Shetty, and A. Stolcke, “Toward fairness in speech recognition: Discovery and mitigation of performance disparities,” in Interspeech 2022 , 2022.
  • [4] A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. Rickford, D. Jurafsky, and S. Goel, “Racial disparities in automated speech recognition,” Proceedings of the National Academy of Sciences , vol. 117, p. 201915768, 03 2020.
  • [5] M. Riviere, J. Copet, and G. Synnaeve, “Asr4real: An extended benchmark for speech models,” 10 2021.
  • [6] S. Feng, O. Kudina, B. Halpern, and O. Scharenborg, “Quantifying bias in automatic speech recognition,” 03 2021.
  • [7] T. Kendall and C. Farrington, “The corpus of regional african american language,” Version , vol. 6, p. 1, 2018.
  • [8] “Stanford linguistics, voices of california,” http://web.stanford.edu/dept/linguistics/VoCal/.
  • [9] J. P. Bajorek, “A voice recognition still has significant race and gender biases,” pp. Harvard Business Review, [Online]. Available: https://hbr.org/2019/05/voice–recognitionstill–has–significant–race–and–gender–biases/, 5 2019.
  • [10] Z. M. C. Heldreth, M. Lahav, J. Sublewski, and E. Tuennerman, “"i don’t think these devices are very culturally sensitive." - the impact of errors on african americans in automated speech recognition,” 2021.
  • [11] R. Tatman and C. Kasten, “Effects of talker dialect, gender & race on accuracy of bing speech and youtube automatic captions,” in Interspeech , 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:38523141
  • [12] J. L. Martin and K. Tang, “Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be”,” in Proc. Interspeech 2020 , 2020, pp. 626–630.
  • [13] I.-E. Veliche and P. Fung, “Improving fairness and robustness in end-to-end speech recognition through unsupervised clustering,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) .   IEEE, 2023, pp. 1–5.
  • [14] L. Sarı, M. Hasegawa-Johnson, and C. D. Yoo, “Counterfactually fair automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 29, pp. 3515–3525, 2021.
  • [15] M. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in Proceedings of the 31st International Conference on Neural Information Processing Systems , ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 4069–4079.
  • [16] J. Meyer, L. Rauchenstein, J. D. Eisenberg, and N. Howell, “Artie bias corpus: An open dataset for detecting demographic bias in speech applications,” in Proceedings of the Twelfth Language Resources and Evaluation Conference .   Marseille, France: European Language Resources Association, May 2020, pp. 6462–6468. [Online]. Available: https://aclanthology.org/2020.lrec-1.796
  • [17] C. Liu, M. Picheny, L. Sarı, P. Chitkara, A. Xiao, X. Zhang, M. Chou, A. Alvarado, C. Hazirbas, and Y. Saraf, “Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022, pp. 6162–6166.
  • [18] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The icsi meeting corpus,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). , vol. 1, 2003, pp. I–I.
  • [19] A. Graves, “Sequence transduction with recurrent neural networks,” ArXiv , vol. abs/1211.3711, 2012. [Online]. Available: https://api.semanticscholar.org/CorpusID:17194112
  • [20] Y. Shi, Y. Wang, C. Wu, C.-F. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer, “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 6783–6787.
  • [21] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.04356
  • [22] Z. Liu, I.-E. Veliche, and F. Peng, “Model-based approach for measuring the fairness in asr,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022, pp. 6532–6536.
  • [23] B. Efron and R. J. Tibshirani, “An introduction to the bootstrap,” in CRC Press , 1994.
  • [24] C. D. M. Sharon Goldwater, Dan Jurafsky, “Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates.” Speech Communication, 52 (3), pp.181. ff10.1016/j.specom.2009.10.001ff. ffhal-00608401f , 2010.

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Speech recognition and synthesis sample

  • Code Sample
  • 2 contributors

Shows how to use Speech Recognition and Speech Synthesis (Text-to-speech) in UWP apps.

Note: This sample is part of a large collection of UWP feature samples. You can download this sample as a standalone ZIP file from docs.microsoft.com , or you can download the entire collection as a single ZIP file , but be sure to unzip everything to access shared dependencies. For more info on working with the ZIP file, the samples collection, and GitHub, see Get the UWP samples from GitHub . For more samples, see the Samples portal on the Windows Dev Center.

Specifically, this sample covers the following scenarios:

  • Synthesizing text to speech (TTS)
  • Synthesizing Speech Synthesis Markup Language (SSML)
  • One-shot recognition using the predefined dictation grammar
  • One-shot recognition using the predefined web search grammar
  • One-shot recognition using a custom list-based grammar
  • One-shot recognition using a custom SRGS/GRXML grammar
  • Continuous dictation
  • Continuous recognition using a custom list-based grammar
  • Continuous recognition using a custom SRGS/GRXML grammar
  • Pausing and resuming continuous recognition

In addition, translations are shown for speech recognition and text-to-speech for supported languages. Translations provided may not be using ideal phrasing and are provided for demonstration purposes only.

Scenarios 3, 4, and 7 require internet connectivity because they use the SpeechRecognitionTopicConstraint class, which use the pre-defined grammar provided by a web service.

Privacy Policy

Web service-based speech recognition features require acceptance of the Microsoft Privacy Policy. Information about this privacy policy can be found in the Settings app, under Privacy -> Speech, Inking and Typing. You must view the privacy policy in order to accept it. To view the privacy policy, press the Privacy Policy link on the Speech, Inking and Typing settings page.

You can disable functionality that requires accepting this policy by turning off "Getting to know you" under Settings -> Privacy -> Speech, Inking and Typing. The samples will indicate to you if the privacy policy has not been accepted where necessary.

Related topics

Speech recognition Speech synthesis Speech design guidelines Speech interactions Responding to speech interactions (HTML)

Related samples

  • Family Notes sample
  • SpeechRecognitionAndSynthesis sample for JavaScript (archived)

System requirements

  • Speech recognition requires an appropriate audio input device.

Build the sample

  • If you download the samples ZIP, be sure to unzip the entire archive, not just the folder with the sample you want to build.
  • Start Microsoft Visual Studio and select File > Open > Project/Solution .
  • Starting in the folder where you unzipped the samples, go to the Samples subfolder, then the subfolder for this specific sample, then the subfolder for your preferred language (C++, C#, or JavaScript). Double-click the Visual Studio Solution (.sln) file.
  • Press Ctrl+Shift+B, or select Build > Build Solution .

Run the sample

The next steps depend on whether you just want to deploy the sample or you want to both deploy and run it.

Deploying the sample

  • Select Build > Deploy Solution.

Deploying and running the sample

  • To debug the sample and then run it, press F5 or select Debug > Start Debugging. To run the sample without debugging, press Ctrl+F5 or selectDebug > Start Without Debugging.

Known Issues

  • The sample requires Media Player components to be available. If media player has been uninstalled, or when using an 'N' SKU of windows without media player components, the sample will not function. Note, however, that Speech Synthesis and Speech Recognition do not require media player directly, but other components of the samples do (Such as playback of synthesized text, or checking to see if a microphone is present and the app has permission to use it.) Developers should make sure their app is aware of it and handles this gracefully.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

library-reference.rst

Latest commit, file metadata and controls, speech recognition library reference, microphone(device_index: union[int, none] = none, sample_rate: int = 16000, chunk_size: int = 1024) -> microphone.

Creates a new Microphone instance, which represents a physical microphone on the computer. Subclass of AudioSource .

This will throw an AttributeError if you don't have PyAudio 0.2.11 or later installed.

If device_index is unspecified or None , the default microphone is used as the audio source. Otherwise, device_index should be the index of the device to use for audio input.

A device index is an integer between 0 and pyaudio.get_device_count() - 1 (assume we have used import pyaudio beforehand) inclusive. It represents an audio device such as a microphone or speaker. See the PyAudio documentation for more details.

The microphone audio is recorded in chunks of chunk_size samples, at a rate of sample_rate samples per second (Hertz).

Higher sample_rate values result in better audio quality, but also more bandwidth (and therefore, slower recognition). Additionally, some machines, such as some Raspberry Pi models, can't keep up if this value is too high.

Higher chunk_size values help avoid triggering on rapidly changing ambient noise, but also makes detection less sensitive. This value, generally, should be left at its default.

Instances of this class are context managers, and are designed to be used with with statements:

Microphone.list_microphone_names() -> List[str]

Returns a list of the names of all available microphones. For microphones where the name can't be retrieved, the list entry contains None instead.

The index of each microphone's name in the returned list is the same as its device index when creating a Microphone instance - if you want to use the microphone at index 3 in the returned list, use Microphone(device_index=3) .

To create a Microphone instance by name:

Microphone.list_working_microphones() -> Dict[int, str]

Returns a dictionary mapping device indices to microphone names, for microphones that are currently hearing sounds. When using this function, ensure that your microphone is unmuted and make some noise at it to ensure it will be detected as working.

Each key in the returned dictionary can be passed to the Microphone constructor to use that microphone. For example, if the return value is {3: "HDA Intel PCH: ALC3232 Analog (hw:1,0)"} , you can do Microphone(device_index=3) to use that microphone.

To create a Microphone instance for the first working microphone:

AudioFile(filename_or_fileobject: Union[str, io.IOBase]) -> AudioFile

Creates a new AudioFile instance given a WAV/AIFF/FLAC audio file filename_or_fileobject . Subclass of AudioSource .

If filename_or_fileobject is a string, then it is interpreted as a path to an audio file on the filesystem. Otherwise, filename_or_fileobject should be a file-like object such as io.BytesIO or similar.

Note that functions that read from the audio (such as recognizer_instance.record or recognizer_instance.listen ) will move ahead in the stream. For example, if you execute recognizer_instance.record(audiofile_instance, duration=10) twice, the first time it will return the first 10 seconds of audio, and the second time it will return the 10 seconds of audio right after that. This is always reset when entering the context with a context manager.

WAV files must be in PCM/LPCM format; WAVE_FORMAT_EXTENSIBLE and compressed WAV are not supported and may result in undefined behaviour.

Both AIFF and AIFF-C (compressed AIFF) formats are supported.

FLAC files must be in native FLAC format; OGG-FLAC is not supported and may result in undefined behaviour.

audiofile_instance.DURATION # type: float

Represents the length of the audio stored in the audio file in seconds. This property is only available when inside a context - essentially, that means it should only be accessed inside the body of a with audiofile_instance ... statement. Outside of contexts, this property is None .

This is useful when combined with the offset parameter of recognizer_instance.record , since when together it is possible to perform speech recognition in chunks.

However, note that recognizing speech in multiple chunks is not the same as recognizing the whole thing at once. If spoken words appear on the boundaries that we split the audio into chunks on, each chunk only gets part of the word, which may result in inaccurate results.

Recognizer() -> Recognizer

Creates a new Recognizer instance, which represents a collection of speech recognition settings and functionality.

recognizer_instance.energy_threshold = 300 # type: float

Represents the energy level threshold for sounds. Values below this threshold are considered silence, and values above this threshold are considered speech. Can be changed.

This is adjusted automatically if dynamic thresholds are enabled (see recognizer_instance.dynamic_energy_threshold ). A good starting value will generally allow the automatic adjustment to reach a good value faster.

This threshold is associated with the perceived loudness of the sound, but it is a nonlinear relationship. The actual energy threshold you will need depends on your microphone sensitivity or audio data. Typical values for a silent room are 0 to 100, and typical values for speaking are between 150 and 3500. Ambient (non-speaking) noise has a significant impact on what values will work best.

If you're having trouble with the recognizer trying to recognize words even when you're not speaking, try tweaking this to a higher value. If you're having trouble with the recognizer not recognizing your words when you are speaking, try tweaking this to a lower value. For example, a sensitive microphone or microphones in louder rooms might have a ambient energy level of up to 4000:

The dynamic energy threshold setting can mitigate this by increasing or decreasing this automatically to account for ambient noise. However, this takes time to adjust, so it is still possible to get the false positive detections before the threshold settles into a good value.

To avoid this, use recognizer_instance.adjust_for_ambient_noise(source, duration = 1) to calibrate the level to a good value. Alternatively, simply set this property to a high value initially (4000 works well), so the threshold is always above ambient noise levels: over time, it will be automatically decreased to account for ambient noise levels.

recognizer_instance.dynamic_energy_threshold = True # type: bool

Represents whether the energy level threshold (see recognizer_instance.energy_threshold ) for sounds should be automatically adjusted based on the currently ambient noise level while listening. Can be changed.

Recommended for situations where the ambient noise level is unpredictable, which seems to be the majority of use cases. If the ambient noise level is strictly controlled, better results might be achieved by setting this to False to turn it off.

recognizer_instance.dynamic_energy_adjustment_damping = 0.15 # type: float

If the dynamic energy threshold setting is enabled (see recognizer_instance.dynamic_energy_threshold ), represents approximately the fraction of the current energy threshold that is retained after one second of dynamic threshold adjustment. Can be changed (not recommended).

Lower values allow for faster adjustment, but also make it more likely to miss certain phrases (especially those with slowly changing volume). This value should be between 0 and 1. As this value approaches 1, dynamic adjustment has less of an effect over time. When this value is 1, dynamic adjustment has no effect.

recognizer_instance.dynamic_energy_adjustment_ratio = 1.5 # type: float

If the dynamic energy threshold setting is enabled (see recognizer_instance.dynamic_energy_threshold ), represents the minimum factor by which speech is louder than ambient noise. Can be changed (not recommended).

For example, the default value of 1.5 means that speech is at least 1.5 times louder than ambient noise. Smaller values result in more false positives (but fewer false negatives) when ambient noise is loud compared to speech.

recognizer_instance.pause_threshold = 0.8 # type: float

Represents the minimum length of silence (in seconds) that will register as the end of a phrase. Can be changed.

Smaller values result in the recognition completing more quickly, but might result in slower speakers being cut off.

recognizer_instance.operation_timeout = None # type: Union[float, None]

Represents the timeout (in seconds) for internal operations, such as API requests. Can be changed.

Setting this to a reasonable value ensures that these operations will never block indefinitely, though good values depend on your network speed and the expected length of the audio to recognize.

recognizer_instance.record(source: AudioSource, duration: Union[float, None] = None, offset: Union[float, None] = None) -> AudioData

Records up to duration seconds of audio from source (an AudioSource instance) starting at offset (or at the beginning if not specified) into an AudioData instance, which it returns.

If duration is not specified, then it will record until there is no more audio input.

recognizer_instance.adjust_for_ambient_noise(source: AudioSource, duration: float = 1) -> None

Adjusts the energy threshold dynamically using audio from source (an AudioSource instance) to account for ambient noise.

Intended to calibrate the energy threshold with the ambient energy level. Should be used on periods of audio without speech - will stop early if any speech is detected.

The duration parameter is the maximum number of seconds that it will dynamically adjust the threshold for before returning. This value should be at least 0.5 in order to get a representative sample of the ambient noise.

recognizer_instance.listen(source: AudioSource, timeout: Union[float, None] = None, phrase_time_limit: Union[float, None] = None, snowboy_configuration: Union[Tuple[str, Iterable[str]], None] = None) -> AudioData

Records a single phrase from source (an AudioSource instance) into an AudioData instance, which it returns.

This is done by waiting until the audio has an energy above recognizer_instance.energy_threshold (the user has started speaking), and then recording until it encounters recognizer_instance.pause_threshold seconds of non-speaking or there is no more audio input. The ending silence is not included.

The timeout parameter is the maximum number of seconds that this will wait for a phrase to start before giving up and throwing an speech_recognition.WaitTimeoutError exception. If timeout is None , there will be no wait timeout.

The phrase_time_limit parameter is the maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If phrase_timeout is None , there will be no phrase time limit.

The snowboy_configuration parameter allows integration with Snowboy , an offline, high-accuracy, power-efficient hotword recognition engine. When used, this function will pause until Snowboy detects a hotword, after which it will unpause. This parameter should either be None to turn off Snowboy support, or a tuple of the form (SNOWBOY_LOCATION, LIST_OF_HOT_WORD_FILES) , where SNOWBOY_LOCATION is the path to the Snowboy root directory, and LIST_OF_HOT_WORD_FILES is a list of paths to Snowboy hotword configuration files (*.pmdl or *.umdl format).

This operation will always complete within timeout + phrase_timeout seconds if both are numbers, either by returning the audio data, or by raising a speech_recognition.WaitTimeoutError exception.

recognizer_instance.listen_in_background(source: AudioSource, callback: Callable[[Recognizer, AudioData], Any]) -> Callable[bool, None]

Spawns a thread to repeatedly record phrases from source (an AudioSource instance) into an AudioData instance and call callback with that AudioData instance as soon as each phrase are detected.

Returns a function object that, when called, requests that the background listener thread stop. The background thread is a daemon and will not stop the program from exiting if there are no other non-daemon threads. The function accepts one parameter, wait_for_stop : if truthy, the function will wait for the background listener to stop before returning, otherwise it will return immediately and the background listener thread might still be running for a second or two afterwards. Additionally, if you are using a truthy value for wait_for_stop , you must call the function from the same thread you originally called listen_in_background from.

Phrase recognition uses the exact same mechanism as recognizer_instance.listen(source) . The phrase_time_limit parameter works in the same way as the phrase_time_limit parameter for recognizer_instance.listen(source) , as well.

The callback parameter is a function that should accept two parameters - the recognizer_instance , and an AudioData instance representing the captured audio. Note that callback function will be called from a non-main thread.

recognizer_instance.recognize_sphinx(audio_data: AudioData, language: str = "en-US", keyword_entries: Union[Iterable[Tuple[str, float]], None] = None, grammar: Union[str, None] = None, show_all: bool = False) -> Union[str, pocketsphinx.pocketsphinx.Decoder]

Performs speech recognition on audio_data (an AudioData instance), using CMU Sphinx.

The recognition language is determined by language , an RFC5646 language tag like "en-US" or "en-GB" , defaulting to US English. Out of the box, only en-US is supported. See Notes on using `PocketSphinx for information about installing other languages. This document is also included under reference/pocketsphinx.rst . The language parameter can also be a tuple of filesystem paths, of the form (acoustic_parameters_directory, language_model_file, phoneme_dictionary_file) - this allows you to load arbitrary Sphinx models.

If specified, the keywords to search for are determined by keyword_entries , an iterable of tuples of the form (keyword, sensitivity) , where keyword is a phrase, and sensitivity is how sensitive to this phrase the recognizer should be, on a scale of 0 (very insensitive, more false negatives) to 1 (very sensitive, more false positives) inclusive. If not specified or None , no keywords are used and Sphinx will simply transcribe whatever words it recognizes. Specifying keyword_entries is more accurate than just looking for those same keywords in non-keyword-based transcriptions, because Sphinx knows specifically what sounds to look for.

Sphinx can also handle FSG or JSGF grammars. The parameter grammar expects a path to the grammar file. Note that if a JSGF grammar is passed, an FSG grammar will be created at the same location to speed up execution in the next run. If keyword_entries are passed, content of grammar will be ignored.

Returns the most likely transcription if show_all is false (the default). Otherwise, returns the Sphinx pocketsphinx.pocketsphinx.Decoder object resulting from the recognition.

Raises a speech_recognition.UnknownValueError exception if the speech is unintelligible. Raises a speech_recognition.RequestError exception if there are any issues with the Sphinx installation.

recognizer_instance.recognize_google(audio_data: AudioData, key: Union[str, None] = None, language: str = "en-US", , pfilter: Union[0, 1], show_all: bool = False) -> Union[str, Dict[str, Any]]

Performs speech recognition on audio_data (an AudioData instance), using the Google Speech Recognition API.

The Google Speech Recognition API key is specified by key . If not specified, it uses a generic key that works out of the box. This should generally be used for personal or testing purposes only, as it may be revoked by Google at any time .

To obtain your own API key, simply follow the steps on the API Keys page at the Chromium Developers site. In the Google Developers Console, Google Speech Recognition is listed as "Speech API". Note that the API quota for your own keys is 50 requests per day , and there is currently no way to raise this limit.

The recognition language is determined by language , an IETF language tag like "en-US" or "en-GB" , defaulting to US English. A list of supported language tags can be found here . Basically, language codes can be just the language ( en ), or a language with a dialect ( en-US ).

The profanity filter level can be adjusted with pfilter : 0 - No filter, 1 - Only shows the first character and replaces the rest with asterisks. The default is level 0.

Returns the most likely transcription if show_all is false (the default). Otherwise, returns the raw API response as a JSON dictionary.

Raises a speech_recognition.UnknownValueError exception if the speech is unintelligible. Raises a speech_recognition.RequestError exception if the speech recognition operation failed, if the key isn't valid, or if there is no internet connection.

recognizer_instance.recognize_google_cloud(audio_data: AudioData, credentials_json: Union[str, None] = None, language: str = "en-US", preferred_phrases: Union[Iterable[str], None] = None, show_all: bool = False) -> Union[str, Dict[str, Any]]

Performs speech recognition on audio_data (an AudioData instance), using the Google Cloud Speech API.

This function requires a Google Cloud Platform account; see the Google Cloud Speech API Quickstart for details and instructions. Basically, create a project, enable billing for the project, enable the Google Cloud Speech API for the project, and set up Service Account Key credentials for the project. The result is a JSON file containing the API credentials. The text content of this JSON file is specified by credentials_json . If not specified, the library will try to automatically find the default API credentials JSON file .

The recognition language is determined by language , which is a BCP-47 language tag like "en-US" (US English). A list of supported language tags can be found in the Google Cloud Speech API documentation .

If preferred_phrases is an iterable of phrase strings, those given phrases will be more likely to be recognized over similar-sounding alternatives. This is useful for things like keyword/command recognition or adding new phrases that aren't in Google's vocabulary. Note that the API imposes certain restrictions on the list of phrase strings .

Returns the most likely transcription if show_all is False (the default). Otherwise, returns the raw API response as a JSON dictionary.

Raises a speech_recognition.UnknownValueError exception if the speech is unintelligible. Raises a speech_recognition.RequestError exception if the speech recognition operation failed, if the credentials aren't valid, or if there is no Internet connection.

recognizer_instance.recognize_wit(audio_data: AudioData, key: str, show_all: bool = False) -> Union[str, Dict[str, Any]]

Performs speech recognition on audio_data (an AudioData instance), using the Wit.ai API.

The Wit.ai API key is specified by key . Unfortunately, these are not available without signing up for an account and creating an app. You will need to add at least one intent to the app before you can see the API key, though the actual intent settings don't matter.

To get the API key for a Wit.ai app, go to the app's overview page, go to the section titled "Make an API request", and look for something along the lines of Authorization: Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ; XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX is the API key. Wit.ai API keys are 32-character uppercase alphanumeric strings.

The recognition language is configured in the Wit.ai app settings.

recognizer_instance.recognize_bing(audio_data: AudioData, key: str, language: str = "en-US", show_all: bool = False) -> Union[str, Dict[str, Any]]

Performs speech recognition on audio_data (an AudioData instance), using the Microsoft Bing Speech API.

The Microsoft Bing Speech API key is specified by key . Unfortunately, these are not available without signing up for an account with Microsoft Azure.

To get the API key, go to the Microsoft Azure Portal Resources page, go to "All Resources" > "Add" > "See All" > Search "Bing Speech API > "Create", and fill in the form to make a "Bing Speech API" resource. On the resulting page (which is also accessible from the "All Resources" page in the Azure Portal), go to the "Show Access Keys" page, which will have two API keys, either of which can be used for the key parameter. Microsoft Bing Speech API keys are 32-character lowercase hexadecimal strings.

The recognition language is determined by language , a BCP-47 language tag like "en-US" (US English) or "fr-FR" (International French), defaulting to US English. A list of supported language values can be found in the API documentation under "Interactive and dictation mode".

recognizer_instance.recognize_houndify(audio_data: AudioData, client_id: str, client_key: str, show_all: bool = False) -> Union[str, Dict[str, Any]]

Performs speech recognition on audio_data (an AudioData instance), using the Houndify API.

The Houndify client ID and client key are specified by client_id and client_key , respectively. Unfortunately, these are not available without signing up for an account . Once logged into the dashboard , you will want to select "Register a new client", and fill in the form as necessary. When at the "Enable Domains" page, enable the "Speech To Text Only" domain, and then select "Save & Continue".

To get the client ID and client key for a Houndify client, go to the dashboard and select the client's "View Details" link. On the resulting page, the client ID and client key will be visible. Client IDs and client keys are both Base64-encoded strings.

Currently, only English is supported as a recognition language.

recognizer_instance.recognize_ibm(audio_data: AudioData, username: str, password: str, language: str = "en-US", show_all: bool = False) -> Union[str, Dict[str, Any]]

Performs speech recognition on audio_data (an AudioData instance), using the IBM Speech to Text API.

The IBM Speech to Text username and password are specified by username and password , respectively. Unfortunately, these are not available without signing up for an account . Once logged into the Bluemix console, follow the instructions for creating an IBM Watson service instance , where the Watson service is "Speech To Text". IBM Speech to Text usernames are strings of the form XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, while passwords are mixed-case alphanumeric strings.

The recognition language is determined by language , an RFC5646 language tag with a dialect like "en-US" (US English) or "zh-CN" (Mandarin Chinese), defaulting to US English. The supported language values are listed under the model parameter of the audio recognition API documentation , in the form LANGUAGE_BroadbandModel , where LANGUAGE is the language value.

recognizer_instance.recognize_whisper(audio_data: AudioData, model: str="base", show_dict: bool=False, load_options: Dict[Any, Any]=None, language:Optional[str]=None, translate:bool=False, **transcribe_options):

Performs speech recognition on audio_data (an AudioData instance), using Whisper.

The recognition language is determined by language , an uncapitalized full language name like "english" or "chinese". See the full language list at https://github.com/openai/whisper/blob/main/whisper/tokenizer.py

model can be any of tiny, base, small, medium, large, tiny.en, base.en, small.en, medium.en. See https://github.com/openai/whisper for more details.

If show_dict is true, returns the full dict response from Whisper, including the detected language. Otherwise returns only the transcription.

You can translate the result to english with Whisper by passing translate=True

Other values are passed directly to whisper. See https://github.com/openai/whisper/blob/main/whisper/transcribe.py for all options

recognizer_instance.recognize_whisper_api(audio_data: AudioData, model: str = "whisper-1", api_key: str | None = None)

Performs speech recognition on audio_data (an AudioData instance), using the OpenAI Whisper API.

This function requires an OpenAI account; visit https://platform.openai.com/signup , then generate API Key in User settings .

Detail: https://platform.openai.com/docs/guides/speech-to-text

Raises a speech_recognition.exceptions.SetupError exception if there are any issues with the openai installation, or the environment variable is missing.

AudioSource

Base class representing audio sources. Do not instantiate.

Instances of subclasses of this class, such as Microphone and AudioFile , can be passed to things like recognizer_instance.record and recognizer_instance.listen . Those instances act like context managers, and are designed to be used with with statements.

For more information, see the documentation for the individual subclasses.

AudioData(frame_data: bytes, sample_rate: int, sample_width: int) -> AudioData

Creates a new AudioData instance, which represents mono audio data.

The raw audio data is specified by frame_data , which is a sequence of bytes representing audio samples. This is the frame data structure used by the PCM WAV format.

The width of each sample, in bytes, is specified by sample_width . Each group of sample_width bytes represents a single audio sample.

The audio data is assumed to have a sample rate of sample_rate samples per second (Hertz).

Usually, instances of this class are obtained from recognizer_instance.record or recognizer_instance.listen , or in the callback for recognizer_instance.listen_in_background , rather than instantiating them directly.

audiodata_instance.get_segment(start_ms: Union[float, None] = None, end_ms: Union[float, None] = None) -> AudioData

Returns a new AudioData instance, trimmed to a given time interval. In other words, an AudioData instance with the same audio data except starting at start_ms milliseconds in and ending end_ms milliseconds in.

If not specified, start_ms defaults to the beginning of the audio, and end_ms defaults to the end.

audiodata_instance.get_raw_data(convert_rate: Union[int, None] = None, convert_width: Union[int, None] = None) -> bytes

Returns a byte string representing the raw frame data for the audio represented by the AudioData instance.

If convert_rate is specified and the audio sample rate is not convert_rate Hz, the resulting audio is resampled to match.

If convert_width is specified and the audio samples are not convert_width bytes each, the resulting audio is converted to match.

Writing these bytes directly to a file results in a valid RAW/PCM audio file .

audiodata_instance.get_wav_data(convert_rate: Union[int, None] = None, convert_width: Union[int, None] = None) -> bytes

Returns a byte string representing the contents of a WAV file containing the audio represented by the AudioData instance.

Writing these bytes directly to a file results in a valid WAV file .

audiodata_instance.get_aiff_data(convert_rate: Union[int, None] = None, convert_width: Union[int, None] = None) -> bytes

Returns a byte string representing the contents of an AIFF-C file containing the audio represented by the AudioData instance.

Writing these bytes directly to a file results in a valid AIFF-C file .

audiodata_instance.get_flac_data(convert_rate: Union[int, None] = None, convert_width: Union[int, None] = None) -> bytes

Returns a byte string representing the contents of a FLAC file containing the audio represented by the AudioData instance.

Note that 32-bit FLAC is not supported. If the audio data is 32-bit and convert_width is not specified, then the resulting FLAC will be a 24-bit FLAC.

Writing these bytes directly to a file results in a valid FLAC file .

  • Español – América Latina
  • Português – Brasil
  • Documentation

Class RecognitionConfig (2.27.0)

  • 2.27.0 (latest)

Provides information to the recognizer that specifies how to process the request.


Encoding of audio data sent in all messages. This field is optional for and audio files and required for all other audio formats. For details, see AudioEncoding.

Sample rate in Hertz of the audio data sent in all messages. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that's not possible, use the native sample rate of the audio source (instead of re-sampling). This field is optional for FLAC and WAV audio files, but is required for all other audio formats. For details, see AudioEncoding.

The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16, OGG_OPUS and FLAC are - . Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only . If or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default. To perform independent recognition on each channel set to 'true'.

This needs to be set to explicitly and > 1 to get each channel recognized separately. The recognition result will contain a field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: multiplied by the length of the audio.

Required. The language of the supplied audio as a __ language tag. Example: "en-US". See `Language Support

A list of up to 3 additional __ language tags, listing possible alternative languages of the supplied audio. See `Language Support

Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of messages within each . The server may return fewer than . Valid values are - . A value of or will return a maximum of one. If omitted, will return a maximum of one.

If set to , the server will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks, e.g. "f***". If set to or omitted, profanities won't be filtered out.

Speech adaptation configuration improves the accuracy of speech recognition. For more information, see the `speech adaptation

Use transcription normalization to automatically replace parts of the transcript with phrases of your choosing. For StreamingRecognize, this normalization only applies to stable partial transcripts (stability > 0.8) and final transcripts.
]
Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see `speech adaptation

If , the top result includes a list of words and the start and end time offsets (timestamps) for those words. If , no word-level time offset information is returned. The default is .

If , the top result includes a list of words and the confidence for those words. If , no word-level confidence information is returned. The default is .

If 'true', adds punctuation to recognition result hypotheses. This feature is only available in select languages. Setting this for requests in other languages has no effect at all. The default 'false' value does not add punctuation to result hypotheses.

The spoken punctuation behavior for the call If not set, uses default behavior based on model of choice e.g. command_and_search will enable spoken punctuation by default If 'true', replaces spoken punctuation with the corresponding symbols in the request. For example, "how are you question mark" becomes "how are you?". See https://cloud.google.com/speech-to-text/docs/spoken-punctuation for support. If 'false', spoken punctuation is not replaced.

The spoken emoji behavior for the call If not set, uses default behavior based on model of choice If 'true', adds spoken emoji formatting for the request. This will replace spoken emojis with the corresponding Unicode symbols in the final transcript. If 'false', spoken emojis are not replaced.

If 'true', enables speaker detection for each recognized word in the top alternative of the recognition result using a speaker_tag provided in the WordInfo. Note: Use diarization_config instead.

If set, specifies the estimated number of speakers in the conversation. Defaults to '2'. Ignored unless enable_speaker_diarization is set to true. Note: Use diarization_config instead.

Config to enable speaker diarization and set additional parameters to make diarization better suited for your application. Note: When this is enabled, we send all the words from the beginning of the audio for the top alternative in every consecutive STREAMING responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time. For non-streaming requests, the diarization results will be provided only in the top alternative of the FINAL SpeechRecognitionResult.

Metadata regarding this request.

Which model to select for the given request. Select the model best suited to your domain to get best results. If a model is not explicitly specified, then we auto-select a model based on the parameters in the RecognitionConfig. .. raw:: html
Best for long form content like media or conversation.
Best for short form content like commands or single shot directed speech.
Best for short queries such as voice commands or voice search.
Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate).
Best for audio that originated from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate.
Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.
Best for audio that originated from a conversation between a medical provider and patient.
Best for audio that originated from dictation notes by a medical provider.

AudioEncoding

The encoding of the audio data sent in the request.

All encodings support only 1 channel (mono) audio, unless the audio_channel_count and enable_separate_recognition_per_channel fields are set.

For best results, the audio source should be captured and transmitted using a lossless encoding ( FLAC or LINEAR16 ). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW , AMR , AMR_WB , OGG_OPUS , SPEEX_WITH_HEADER_BYTE , MP3 , and WEBM_OPUS .

The FLAC and WAV audio file formats include a header that describes the included audio content. You can request recognition for WAV files that contain either LINEAR16 or MULAW encoded audio. If you send FLAC or WAV audio file format in your request, you do not need to specify an AudioEncoding ; the audio encoding format is determined from the file header. If you specify an AudioEncoding when you send send FLAC or WAV audio, the encoding configuration must match the encoding described in the audio header; otherwise the request returns an google.rpc.Code.INVALID_ARGUMENT][google.rpc.Code.INVALID_ARGUMENT] error code.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-08-21 UTC.

IMAGES

  1. The correct rate of speech recognition.

    sample rate for speech recognition

  2. 4: Speech recognition and speaker recognition performances using the

    sample rate for speech recognition

  3. Speech Recognition Rate (in %) Before and After Speaker Normalization

    sample rate for speech recognition

  4. Speech Recognition Rate

    sample rate for speech recognition

  5. Speech rate recognition for high rate male speech

    sample rate for speech recognition

  6. Speech Recognition Rate

    sample rate for speech recognition

COMMENTS

  1. Simple audio recognition: Recognizing keywords

    Simple audio recognition: Recognizing keywords. This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic automatic speech recognition (ASR) model for recognizing ten different words. You will use a portion of the Speech Commands dataset ( Warden, 2018 ), which contains short (one-second or less ...

  2. Best practices to provide data to the Speech-to-Text API

    Sampling rate. If possible, set the sampling rate of the audio source to 16000 Hz. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Frame size. Streaming recognition recognizes live audio as it is captured from a microphone or other audio source.

  3. Audio Sampling and Sample Rate

    The choice of sample rate often depends on the intended use: ... 16 kHz: Strikes a balance between quality and file size, used in voice commands and speech recognition technologies. (Yes, Picovoice engines require 16 kHz.) 8 kHz: This low rate is used when bandwidth is limited, ...

  4. Detailed Guide on Sample Rate for ASR! [2023]

    Recommended Sample Rates for Various ASR Use Cases. We are clear till now that choosing the optimal sample rate depends on your use case. Below are some of the common ASR use cases and generally used sample rates for them. Voice Assistants (e.g., Siri, Alexa, Google Assistant): - Optimal Sample Rate: 16 kHz to 48 kHz.

  5. A Complete Guide to Audio Datasets

    The load_dataset function prepares audio samples with the sampling rate that they were published with. This is not always the sampling rate expected by our model. ... Speech recognition, or speech-to-text, is the task of mapping from spoken speech to written text, where both the speech and text are in the same language. ...

  6. Optimize audio files for Speech-to-Text

    For more information about telephony and other sound applications, see Transcribing phone audio with enhanced models. We recommend a sample rate of at least 16 kHz in the audio files that you use for transcription with Speech-to-Text. Sample rates found in audio files are typically 16 kHz, 32 kHz, 44.1 kHz, and 48 kHz.

  7. Hands-On Speech Recognition Engine with Keras and Python

    Basically, it is an uncompressed .wav audio file. The "CD Quality" audio is sampled at 44.1 kHz (44,100 readings per second). But for speech recognition, a sampling rate of 16khz (16,000 samples per second) is sufficient to cover the frequency range of human speech.

  8. A Step-by-Step Guide to Speech Recognition and Audio Signal Processing

    Common sampling frequencies are 8 kHz, 16 kHz, and 44.1 kHz. A 1 Hz sampling rate means one sample per second and therefore high sampling rates mean better signal quality. ... for simpler validation and model training. A sample resolution is always measured in bits per sample. A general Speech Recognition system is designed to perform the tasks ...

  9. Cloud Speech-to-Text V2 documentation

    Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Speech-to-Text API at its native sample rate. Languages. Speech-to-Text's recognition engine supports a variety of languages and dialects.

  10. Speech Recognition with Wav2Vec2

    Speech Recognition with Wav2Vec2¶ Author: Moto Hira. This tutorial shows how to perform speech recognition using using pre-trained models from wav2vec 2.0 . Overview¶ The process of speech recognition looks like the following. Extract the acoustic features from audio waveform. Estimate the class of the acoustic features frame-by-frame

  11. Speech Recognition Analysis

    Similar to image recognition, the most important part of speech recognition is to convert audio files into 2X2 arrays. Sample rate and raw wave of audio files: Sample rate of an audio file represents the number of samples of audio carried per second and is measured in Hz.

  12. What is the best sample rate for Google Speech API? Any Google employee

    16 kHz is just the recommended sample rate to be used for transcribing Speech-to-Text. 1. We recommend a sample rate of at least 16 kHz in the audio files that you use for transcription with Speech-to-Text. Sample rates found in audio files are typically 16 kHz, 32 kHz, 44.1 kHz, and 48 kHz.

  13. DeepSpeech for Dummies

    The original DeepSpeech paper from Baidu popularized the concept of "end-to-end" speech recognition models. "End-to-end" means that the model takes in audio, and directly outputs characters or words. ... is_speech = vad.is_speech(frame.bytes, sample_rate) if not triggered: ring_buffer.append((frame, is_speech)) num_voiced = len([f for f ...

  14. How to synthesize speech from text

    Customizing output sample rate and bit rate. Submitting synthesis requests by using Speech Synthesis Markup Language (SSML). Using neural voices. Subscribing to events and acting on results. Select synthesis language and voice. The text to speech feature in the Speech service supports more than 400 voices and more than 140 languages and variants.

  15. speech processing

    Context is speech recognition. sampling; speech-processing; speech-recognition; mfcc; speech; Share. Improve this question. Follow asked Sep 4, 2018 at 11:46. EddieM ... add more samples for fixed duration requires increasing the sample rate, which requires padding more zeros (increasing fft order) to keep the frequency bin width the same. This ...

  16. Speech Emotion Recognition: An Empirical Analysis of Machine ...

    Zero-Crossing Rate (ZCR): The zero-crossing rate (ZCR) is the rate or the number of times a speech signal passes through a value of zero (positive value to zero to a negative value or vice versa) in a given time interval. The zero-crossing rate of a speech signal is calculated by comparing the sign of each successive sample pair . The formula is:

  17. Why record at highest sampling/bits per sample rates?

    VoxForge has decided that the best approach (for now) is to collect speech recorded at the highest sampling rate your audio card support, at 16-bits per sample, and then downsample the audio to sampling rates that can be supported by the speech medium. For example, for Command and Control applications on a desktop PC, you can downsample the ...

  18. Does speaker recognition system need the same sample rate audio?

    I'm a new man in speaker recognition project. I've train a model with some .wav files, its sample rate is 16000. I want to test this model with my record, but my record's sample rate is 8000. Doed it need the same sample rate for training and testing speaker recognition system? Hope to have your answers, Thanks

  19. Speaker Recognition

    We prepare a dataset of speech samples from different speakers, with the speaker as label. We add background noise to these samples to augment our data. We take the FFT of these samples. We train a 1D convnet to predict the correct speaker given a noisy FFT speech sample. Note: This example should be run with TensorFlow 2.3 or higher, or tf ...

  20. GitHub

    It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. Approach A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation ...

  21. RecognitionConfig

    Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. ... The industry vertical to which this speech recognition request most closely applies. This is most indicative of the topics contained in the audio.

  22. RevoSpeechTech/speech-datasets-collection

    This is a curated list of open speech datasets for speech-related research (mainly for Automatic Speech Recognition). Over 110 speech datasets are collected in this repository, and more than 70 datasets can be downloaded directly without further application or registration. Notice: This repository does not show corresponding License of each ...

  23. Towards measuring fairness in speech recognition: Fair-Speech dataset

    The current public datasets for speech recognition (ASR) tend not to focus specifically on the fairness aspect, such as performance across different demographic groups. ... due to speaker variability, how many samples we have and confounding factors. Thus, we use an model-based approach to measure fairness, that takes into account all these ...

  24. Speech recognition and synthesis sample

    Specifically, this sample covers the following scenarios: Synthesizing text to speech (TTS) Synthesizing Speech Synthesis Markup Language (SSML) One-shot recognition using the predefined dictation grammar. One-shot recognition using the predefined web search grammar. One-shot recognition using a custom list-based grammar.

  25. speech_recognition/reference/library-reference.rst at master · Uberi

    The microphone audio is recorded in chunks of chunk_size samples, at a rate of sample_rate samples per second (Hertz). Higher sample_rate values result in better audio quality, but also more bandwidth (and therefore, slower recognition).

  26. Class RecognitionConfig (2.27.0)

    Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. ... A means to provide context to assist the speech recognition. For more information, see `speech adaptation : enable_word_time_offsets: bool