Introduction
In this tutorial, we'll walk through creating a sophisticated Text-to-Speech (TTS) application using the Hugging Face Transformers library. Our application will leverage the powerful "bark" model from Suno to generate high-quality audio from text input. We'll also create an intuitive user interface using IPython widgets, making it easy for users to interact with our TTS system.
Prerequisites
Before we begin, make sure you have the following libraries installed:
transformers
scipy
ipywidgets
torch (with CUDA support for GPU acceleration)
Key Features of Our TTS Application
Text-to-Speech Conversion: Convert any text input into natural-sounding speech.
Multiple Voice Options: Choose from different speaker presets for varied vocal characteristics.
Non-Speech Sound Insertion: Easily add laughter, sighs, music, and other non-speech sounds.
Speaker Bias: Influence the speech generation with male or female speaker biases.
Custom Output: Specify custom filenames for the generated audio files.
User-Friendly Interface: A sleek, modern UI design for easy interaction.
Implementation
We will load the model and processor using the transformers library:
from transformers import AutoProcessor, AutoModel
import scipy
import ipywidgets as widgets
from IPython.display import display, HTML
import scipy.io.wavfile
# Load the model and processor
processor = AutoProcessor.from_pretrained("suno/bark")
model = AutoModel.from_pretrained("suno/bark")
model.to('cuda')
The app uses the AutoProcessor and AutoModel classes from the Transformers library. The TTS model, "suno/bark," is loaded onto a CUDA-enabled device for efficient processing.
Designing the User Interface
We will create a UI that is both functional and aesthetically pleasing, using ipywidgets for the interactive elements and custom CSS for styling.
Defining Colors and Styles
To create a cohesive look, we define a color palette and custom styles:
colors = { 'background': '#1A1A2E',
'primary': '#16213E',
'secondary': '#0F3460',
'accent': '#E94E77',
'text': '#FFA07A',
'header_text': '#4d4dff'
}
header_style = {
'color': colors['header_text'],
'font-size': '32px',
'font-weight': 'bold',
'text-align': 'center',
'margin-bottom': '30px',
'font-family': '"Segoe UI", Arial, sans-serif',
'text-shadow': '2px 2px 4px rgba(0,0,0,0.3)'
}
label_style = {
'padding-top': '10px',
'margin-bottom': '5px',
'font-weight': 'bold',
'color': colors['text'],
'font-family': '"Segoe UI", Arial, sans-serif',
'font-size': '16px'
}
input_style = {
'width': '100%',
'padding': '12px',
'border': 'none',
'background-color': colors['primary'],
'border-radius': '12px',
'font-family': '"Segoe UI", Arial, sans-serif',
'font-size': '14px',
'color': colors['text'],
'box-shadow': '0 4px 8px rgba(0, 0, 0, 0.2)'
}
# Custom CSS for additional styling
custom_css = f"""
<style>
body {{
background-color: {colors['primary']};
}}
.widget-label {{
font-weight: bold;
margin-bottom: 10px;
margin-top: 30px;
color: {colors['text']};
font-family: "Segoe UI", Arial, sans-serif;
font-size: 16px;
}}
.widget-button:hover {{
background-color: #E87A8D !important;
transform: translateY(-2px);
transition: all 0.3s ease;
}}
</style>
"""
Creating Input Elements
Text Input: We add a text area for entering speech text, a dropdown for selecting voice presets, another dropdown for non-speech sounds, and an input for the output filename.
text_input = widgets.Textarea(
placeholder='Type your text here...',
layout=widgets.Layout(height='150px', **input_style),
style={'description_width': '0px'}
)
Voice Preset Dropdown: A Dropdown widget provides options to select different voice presets, ensuring the speech matches your desired tone.
options: A list of tuples, where each tuple contains a label (e.g., 'Speaker 9') and a corresponding value (e.g., 'v2/en_speaker_9') for the dropdown.
value: Sets the default selected option.
layout: Applies custom layout styling.
style: Ensures consistency with other widgets by setting the description width to 0px.
preset_dropdown = widgets.Dropdown(
options=[
('Speaker 9', 'v2/en_speaker_9'),
('Speaker 0', 'v2/en_speaker_0'),
('Speaker 1', 'v2/en_speaker_1'),
('Speaker 2', 'v2/en_speaker_2')
],
value='v2/en_speaker_9',
layout=widgets.Layout(**input_style),
style={'description_width': '0px'}
)
Non-Speech Elements Dropdown: Another Dropdown allows users to insert non-speech sounds like laughter or music into the text, enhancing the expressiveness of the output.
options: A list of available non-speech elements.
value: Initially set to None, meaning no selection is made by default.
layout: Applies a custom layout style similar to the other inputs.
style: Adjusts the description width to align with the dropdown content.
non_speech_dropdown = widgets.Dropdown(
options=[
('Laughter', '[laughter]'),
('Laughs', '[laughs]'),
('Sighs', '[sighs]'),
('Music', '[music]'),
('Gasps', '[gasps]'),
('Clears Throat', '[clears throat]'),
('Hesitation', '— or ...'),
('Song Lyrics', '♪ for song lyrics ♪'),
('Emphasis', 'CAPITALIZATION for emphasis'),
('Man Bias', 'MAN:'),
('Woman Bias', 'WOMAN:')
],
value=None,
layout=widgets.Layout(**input_style),
style={'description_width': '100px'}
)
Output Filename Input: A Text widget lets users specify the output file name for the generated audio.
value: The default filename for the generated audio.
placeholder: Shows what a typical filename should look like.
layout: Maintains consistent layout styling with the other input fields.
style: Ensures that no additional space is taken up by a label description.
output_filename = widgets.Text(
value='output.wav',
placeholder='output.wav',
layout=widgets.Layout(**input_style),
style={'description_width': '0px'}
)
Functionality to Insert Non-Speech Sounds
We include functionality to allow users to easily insert non-speech sounds into the text area.
def insert_non_speech(change):
if change['new']:
text_input.value += ' ' + change['new']
non_speech_dropdown.value = None # Reset dropdown
non_speech_dropdown.observe(insert_non_speech, names='value')
Adding a Generate Button
A button is provided to generate the audio, and we define the function to process and save the audio file.
generate_button = widgets.Button(
description='Generate Audio',
tooltip='Click to generate audio',
icon='play',
layout=widgets.Layout(
width='70%',
height='50px',
padding='0px',
border_radius='25px',
margin='30px auto',
justify_content='center',
align_items='center',
display='flex',
background_color=colors['accent'],
color=colors['text'],
font_weight='bold',
font_size='18px',
box_shadow='0 6px 12px rgba(233, 78, 119, 0.4)'
)
)
Generate Audio: The generate_button is a key component that triggers the text-to-speech conversion when clicked. The button is styled with a vibrant color and dynamic effects, ensuring a modern look.
def generate_audio(text, preset, output):
inputs = processor(text, voice_preset=preset)
for k, v in inputs.items():
inputs[k] = v.to("cuda")
audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()
sample_rate = model.generation_config.sample_rate
scipy.io.wavfile.write(output, rate=sample_rate, data=audio_array)
print(f"Audio saved as {output}")
generate_button.on_click(lambda b: generate_audio(text_input.value, preset_dropdown.value, output_filename.value))
Assembling the Interface
Finally, we assemble all the components into a single container and display the UI.
container = widgets.VBox([
widgets.HTML(value="<div style='{}'>🎧 Audio Generator</div>".format(';'.join([f"{k}: {v}" for k, v in header_style.items()]))),
widgets.Label('Enter Text:', style=label_style),
text_input,
widgets.Label('Insert Non-Speech:', style=label_style),
non_speech_dropdown,
widgets.Label('Select Voice:', style=label_style),
preset_dropdown,
widgets.Label('Output Filename:', style=label_style),
output_filename,
generate_button
], layout=widgets.Layout(
align_items='stretch',
padding='40px',
border='none',
border_radius='20px',
background_color=colors['background'],
box_shadow=f'0 0 30px {colors["secondary"]}',
width='500px',
margin='0 auto'
))
display(HTML(custom_css))
display(container)
Pros & Cons
We have completed the implementation, let's now look at the pros and cons of this.
Pros
Multiple Voice Options: The model offers various speaker presets, allowing for diverse vocal characteristics. This flexibility makes it adaptable to different use cases, such as narrations, voiceovers, or personalized assistants.
Non-Speech Sound Integration: The ability to insert non-speech sounds like laughter, sighs, or music enhances the expressiveness of the generated speech, making the output more dynamic and engaging.
Customizable Bias: The model allows users to influence the speech generation process with male or female speaker biases, providing greater control over the tone and style of the output.
Cons
Resource-Intensive: The model requires significant computational resources, especially when running on large datasets or in real-time applications. This can be a limitation for users without access to GPUs or cloud resources.
Latency: Despite GPU acceleration, there may be noticeable latency.
Time Limitations: The audio generated by the model is limited to a maximum of 13 seconds.
If you require any assistance with your Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.
Comments