Adding Emotion to Text to Speech Output with the W3C Emotion Markup Language
Adding Emotion to Text to Speech Output with the W3C Emotion Markup Language
A hands-on 45 minute workshop at the 2013 SpeechTEK Conference.
Latest change: 1st March 2013
Description
Adding Emotion to Text to Speech Output with the W3C Emotion Markup Language Emotional state is an important component of interactions between humans, and the manner and tone of speech is one of the most common means by which humans convey emotions to each other. Text to speech (TTS) output would be much more natural and lifelike if it were possible for TTS systems to be configured by developers to easily express emotions. In fact, there has been considerable research on expressive TTS, but most existing systems for expressive TTS use proprietary approaches for adding the expression of emotion to TTS, making it difficult to integrate different products into systems. The World Wide Web Consortium has recently published the Emotion Markup Language (EmotionML) which provides a standard that can be used for annotating text with emotion for TTS (as well as for annotating other system outputs such as facial expressions). This workshop will use the open-source, platform-independent Mary TTS system from DFKI to introduce the concepts of EmotionML. Participants will be able to use Mary and EmotionML to create and save their own expressive short narratives.
Workshop goal
This workshop will give
- a short overview on emotional speech synthesis,
- introduce W3C's EmotionML and
- enable you to gain first own experiences with EmotionML in general and emotional speech synthesis in particular.
Required background knowledge to complete the project
None.
There is some background knowledge and samples on emotional synthetic speech at http://emosamples.syntheticspeech.de/
Summary of the project which the attendee will build during the workshop
The participants will get a famous text, like the first few lines of the "to be or not to be" speech from Hamlet and we'll have a little contest to see who could come up with the best version using EmotionML and the Mary synthesizer.
Workshop requirements
- Hardware requirements
A laptop computer with reaonable resources should suffice.
Please test the software before the workshop.
An internet connection is not required during the workshop, but for the installation.
- What software should be downloaded prior to the workshop?
The Mary Text-to-Speech software can be downloaded here:
http://mary.dfki.de/Download
- Instructions for downloading and installing the software
- Download Mary version 5.0.
- Extract zip or tar file
- Execute "marytts-component-installer" in the "bin" directory, it is based on java. Use the ".bat" extension if under Windows or the ".sh" extension if under a Unix based system.
- Choose at least the language "en-US" and the Voice "cmu-slt-hsmm"
- If you have at least 1 GB disk space left, add language "de" and voice "dfki-pavoque-styles". This will install support and data for German. Because for German there exists a database for non-uniform unit selection synthesis with emotional units, the emotional output sound much more natural than with the other languages.
- Click "install"
- Instructions for testing the installed software to verify that it has been installed correctly
- Start the "marytts-server" (.bat/sh) in the "bin" folder and wait until it's running.
- Start the "marytts-client" (.bat/sh) in the "bin" folder.
- Select the "cmu-slt-hsmm" voice from the dropdown menu at the bottom.
- Set "EmotionML" as input type in the dropdown menu top left.
- Set "Audio" as output type in the dropdown menu top right
- Adjust the output volume of your speakers and press the "Play" butto. You should hear the Mary TTS ouput.
Mary is fully open source. All code is now open source under the LGPL, including German TTS. Voices are distributed under Creative Commons or BSD licenses.
In case of problems or questions, contact workshop organizer Felix Burkhardt
Workshop agenda
- 5 min: Intro to emotional speech synthesis
- 10 min: Intro to EmotionML
- 20 min: Time for the task
- 10 min: Final discussion/ wrap up
We suggest (parts of) this exract from Alice in Wonderland to be enriched with emotional TTS.
Very soon the Rabbit noticed Alice, and called out to her in an angry tone,
`Why, Mary Ann, what are you doing out here? Run home this moment, and fetch me a pair of gloves and a fan! Quick, now!'
`He took me for his housemaid,'
she said to herself as she ran.
`How surprised he'll be when he finds out who I am! But I'd better take him his fan and gloves.'
`How queer it seems,'
Alice said to herself,
`to be going messages for a rabbit! I suppose Dinah'll be sending me on messages next!'
And she began fancying the sort of thing that would happen:
`"Miss Alice! Come here directly, and get ready for your walk!"
"Coming in a minute, nurse! But I've got to see that the mouse doesn't get out."
Only I don't think,'
Alice went on,
`that they'd let Dinah stop in the house if it began ordering people about like that!'
The organizers
Deborah Dahl
Deborah Dahl has over 20 years of experience in speech and natural language technologies, including working on research, defense, and commercial systems. She is also active in speech and multimodal standards activities in the World Wide Web Consortium, serving as Chair of the Multimodal Interaction Working Group and Co-Chair of the Hypertext Coordination Group. She is an editor of the EMMA (Extensible MultiModal Annotation specification).
Dr. Dahl received the prestigious "Speech Luminary" award from Speech Technology Magazine, in 2012.
Felix Burkhardt
Felix Burkhardt does tutoring, consulting, research and development in the working fields human-machine dialog systems, text-to-speech synthesis, speaker classification, ontology based natural language modeling, voice search and emotional human-machine interfaces.
Originally an expert of Speech Synthesis at the Technical University of Berlin, he wrote his ph.d. thesis on the simulation of emotional speech by machines, recorded the Berlin acted emotions database, "EmoDB", and maintains several open source projects, including the emotional speech synthesizer "Emofilt" and the speech labeling and annotation tool "Speechalyzer". He has been working for the Deutsche Telekom AG since 2000, currently for the Telekom Innovation Laboratories in Berlin. He was a member of the European Network of Excellence HUMAINE on emotion-oriented computing and is currently the editor of the W3C Emotion Markup Language specification.