Synthesis of speech is the production of man-made speech. The computer system used for this purpose is called speech computer or speech synthesizer , and can be implemented in software or hardware products. The text-to-speech system ( TTS ) changes the normal language text to speech; other systems make symbolic linguistic representations such as phonetic transcription into speech.
Synthesized speech can be created by merging the recording portion of the conversation stored in the database. The system differs in the size of the stored greeting unit; systems that store phones or diphones provide the largest output range, but may not have clarity. For certain usage domains, whole word or sentence storage allows high-quality results. Alternatively, synthesizers can incorporate sound channel models and other human voice characteristics to create "synthetic" sound output completely.
The quality of the speech synthesizer is judged by its resemblance to the human voice and by its ability to be clearly understood. Understandable text-to-speech programs allow people with visual impairments or reading disabilities to listen to the words written on the home computer. Many computer operating systems have incorporated speech syntheses since the early 1990s.
The text-to-speech (or "engine") system consists of two parts: front-end and back-end. Front-end has two main tasks. First, it converts raw text containing symbols such as numbers and abbreviations to the equivalent of written words. This process is often called text normalization , pre-processing , or tokenisasi . The front-end then gives phonetic transcriptions to every word, and divides and marks the text into prosodic units, such as phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversions. Phonetic transcription and prosodic information together form the representation of linguistic language issued by the front-end. Back-end - often referred to as synthesizer - then converts a symbolic linguistic representation into sound. In certain systems, this section includes the calculation of the target prosodi (tone contour, phoneme duration), which is then imposed on output speech.
Video Speech synthesis
Histori
Long before the invention of electronic signal processing, some people tried to build machines to imitate human speech. Some early legends about the existence of "Brazen Heads" involve the Pope Silvester II (d. 1003 AD), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294).
In 1779 German-German scientist Christian Gottlieb Kratzenstein won the first prize in a competition announced by the Russian Academy of Sciences and Art of the Russian models for the models he built from the human vocal channel that produced five long vowels (in international Alfabet Fon notation) : [a:] , [e:] , < span title = "Representation in International Phonetic Alphabet (IPA)"> [i:] , [o:] and [u:] ). There follows a bell-acoustic-mechanical speech engine operated by the bellows of Wolfgang von Kempelen of Pressburg, Hungary, described in a 1791 paper. The machine adds a model of tongue and lips, allowing it to produce consonants and vowels. In 1837 Charles Wheatstone produced a "speech engine" based on von Kempelen's design, and in 1846, Joseph Faber showed off "Euphonia". In 1923 Paget revived the Wheatstone design.
In the 1930s, Bell Labs developed a vocoder, which automatically analyzed speech to its fundamental tone and resonance. From his work on vocoder, Homer Dudley developed a sound-synthesizer keyboard called The Voder (Voice Demonstrator), which was showcased at the 1939 New York World Fair.
Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories built the Playback Pattern in the late 1940s and completed it in 1950. There are several different versions of this hardware; only one survives today. The machine changes the image of the speech acoustic pattern in the form of a spectrograph back into sound. Using this device, Alvin Liberman and colleagues discovered acoustic cues for the perception of phonetic segments (consonants and vowels).
In 1975 MUSA was released, and was one of the first Speech Synthesis systems. It consists of stand-alone computer hardware and special software that allows it to read Italian. The second version, released in 1978, was also able to sing Italian songs in "a cappella" style.
The dominant system of the 1980s and 1990s was the DECtalk system, largely based on Dennis Klatt's work at MIT, and the Bell Labs system; the latter is one of the first independent language systems, using natural language processing methods extensively.
Early speech-synthesizer electronics sounded robotic and often almost incomprehensible. The quality of synthesized speeches has steadily increased, but by 2016 the output of the contemporary speech synthesis system remains clearly indistinguishable from true human speech.
Kurzweil predicted in 2005 that the cost-performance ratio led to speech synthesizers to be cheaper and more accessible, more people would benefit from the use of text-to-speech programs.
Electronic devices
The first computer-based synthesis-based speech system originated in the late 1950s. Noriko Umeda et al. developed the first English text-to-speech system in 1968 at Electrotechnical Laboratory, Japan. In 1961 physicist John Larry Kelly, Jr., and his colleague Louis Gerstman used IBM 704 computers to synthesize speeches, an event among the most prominent in the history of Bell Labs. Voice recorder Kelly synthesizer (vocoder) re-made the song "Daisy Bell", with musical accompaniment from Max Mathews. Coincidentally, Arthur C. Clarke visited his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was very impressed with the demonstrations he used in his scenes climax scene for his 2001 novel: A Space Odyssey, where the HAL 9000 computer sang the same song as astronaut Dave Bowman putting him to sleep. Despite the successful synthesis of pure electronic speech, research into mechanical synthesis-synthesis continues.
Electronic handhelds featuring speech synthesis began to emerge in the 1970s. One of the first is Telesensory Systems Inc. (TSI) Speech portable calculator for the blind in 1976. Other devices mainly have educational goals, such as Speak & amp; Toy spell produced by Texas Instruments in 1978. Fidelity released a talking version of its electronic chess computer in 1979. The first video game featuring speech synthesis was the 1980s shooting 'em up arcade game, Stratovox (known in Japan as Speak & Rescue ), from Sun Electronics. The first personal computer game with sound synthesis was Manbiki Shoujo ( Disappointing Girl ), released in 1980 for PET 2001, where game developer Hiroshi Suzuki developed " > zero cross "programming technique to generate synthesized speech waveforms. Another early example, an arcade version of Berzerk , also dates from 1980. Milton Bradley Company produced the first multiplayer gameplay to use sound synthesis, Milton , in a year.
Maps Speech synthesis
Synthesizer Technology
The most important qualities of speech synthesis systems are naturalness and clarity . Kealamian explains how closely the output sounds like a human speech, while clarity is facilitated by which the output is understood. The ideal speech synthesizer is natural and understandable. The speech synthesis system usually tries to maximize both characteristics.
The two main technologies that produce synthetic speech waveforms are concatenative synthesis and formant synthesis . Each technology has its strengths and weaknesses, and the use of the intended synthesis system will usually determine which approach is used.
Combined synthesis
The concatenative synthesis is based on merging (or assembling together) from the recording segment of speech. In general, concatenative synthesis produces a speech synthesis that sounds natural. However, the difference between natural variations in speech and the nature of automatic techniques to segment the waveform sometimes results in audible noise at the output. There are three main sub-types of concatenative synthesis.
Unit selection synthesis
The synthesis of unit selection uses a large database of recorded speech. During the making of the database, each recorded speech is segmented into some or all of the following: individual phones, diphones, half phones, syllables, morphemes, words, phrases, and sentences. Typically, division into segments is performed using a custom modified speech recognition set to "forced alignment" mode with some manual correction afterwards, using visual representations such as waveforms and spectrographs. Index units in speech databases are then made based on segmentation and acoustic parameters such as the base frequency (pitch), duration, position in syllables, and adjacent phones. At run time, the desired target speech is made by determining the best chains of candidate units from the database (unit selection). This process is usually achieved by using a special weighted decision tree.
The selection of the unit provides the greatest naturalness, as only a small amount of digital signal processing (DSP) to the recorded conversation takes place. DSPs often make the sound sounds less natural, although some systems use a small amount of signal processing at the point of incorporation to smooth the waveform. The output of the best unit-selection system is often indistinguishable from real human sounds, especially in the context in which the TTS system has been set. However, maximum naturalness usually requires a large election-unit speech database, in some systems from gigabytes of recorded data, representing dozens of hours of talk. Also, unit selection algorithms have been known to select segments from places that produce less than ideal syntheses (eg minor words to be unclear) even when better options exist in the database. Recently, researchers have proposed various automated methods for detecting unnatural segments in a unit-by-unit trial synthesis system.
Synthesis phone
The telephone synthesis uses a speech database containing all the diphones (voice-to-voice transitions) that occur in a language. The number of diphones depends on the phonotactics of the language: for example, Spain has about 800 diphones, and about 2500 German. In diphone synthesis, only one instance of each diphone is contained in the speech database. At runtime, the prosodic target of a sentence is superimposed on these minimal units using digital signal processing techniques such as linear prediction coding, PSOLA or MBROLA. or more advanced techniques such as pitch modification in the source domain using discrete cosine transforms. Brandon synthesis suffers from sonic disorders of synthetic concatenative and synthetic properties of robot formers, and has little advantage over good approaches other than small size. Thus, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.
Domain-specific synthesis
The domain-specific synthesis incorporates previously recorded words and phrases to create a complete speech. This is used in applications where the various text that the system will issue is limited to certain domains, such as announcements of transit schedules or weather reports. This technology is very simple to implement, and has been used commercially for a long time, in devices such as clocks and talking calculators. The level of naturalness of this system can be very high because of the variation of sentence types is limited, and they are very compatible with the prosody and tone of original recording.
Because these systems are limited by words and phrases in their databases, they are not general-purpose and can only synthesize a combination of pre-programmed words and phrases. Mixing words in natural spoken language however can still cause problems unless many variations are taken into account. For example, in the English non-random dialect, the word "" r " in words like " clear " is manifested as /? Kl ??? 't/.) Likewise in French, many of the final consonants become no longer silent if followed by a word beginning with a vowel, an effect called a liaison. this can not be reproduced by a simple word-combining system, which will require additional complexity to be context-sensitive. Formant synthesis
Formant synthesis does not use human speech samples at runtime. In contrast, synthesized speech output is created using additive synthesis and acoustic modeling (physical modeling synthesis). Parameters such as the fundamental frequency, sound, and noise levels vary over time to create waveforms from artificial speech. This method is sometimes called rule-based synthesis ; However, many concatenative systems also have rule-based components. Many systems are based on synthesis technology capable of producing artificial speech, which sounds like a robot that will never be misunderstood as human speech. However, maximum naturalness is not always the goal of speech synthesis systems, and formant synthesis systems have advantages over concatenative systems. Speeches synthesized with forma can be trusted, even at very high speeds, avoiding acoustic disturbances that usually interfere with concatenative systems. High speed synthesized speech is used by people with visual impairments to quickly navigate the computer using a screen reader. Formant synthesizers are typically smaller programs than concatenative systems because they do not have sample speech databases. Therefore they can be used in embedded systems, where memory and microprocessor power are very limited. Since formant-based systems have full control over all aspects of output speech, a variety of prosodies and intonations can be generated, not just conveying questions and statements, but various emotions and tone of voice.
Examples of non-real-time intonation controls but very accurate in formant synthesis include work done in the late 1970s for Texas Instruments Speak & amp; Mantra, and in the early 1980s the Sega arcade machine and in many arcade games Atari, Inc. using TMS5220 LPC Chips. Making the right intonation for these projects is really tiring, and the results can not be matched with real-time text-to-speech interfaces.
Formant synthesis is implemented in hardware in Yamaha FS1R synthesizer, but the speech aspect of formants is never realized in synth. It's capable of short, sequent sequences a few seconds that can speak one sentence, but because the MIDI control interface is a straightforward direct speech it is impossible.
Articulation synthesis
Articulation synthesis refers to computational techniques for synthesizing sounds based on the human vocal channel model and the articulation process that occurs there. The first articulatory synthesizers were regularly used for laboratory experiments developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, ââand Paul Mermelstein. This synthesizer, known as ASY, is based on a sound channel model developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
To date, the articulation synthesis model has not been incorporated into the commercial speech synthesis system. An important exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. After the death of various incarnations of NeXT (started by Steve Jobs in the late 1980s and joined Apple Computer in 1997), Trillium software was published under the GNU General Public License, with continuing work as gnuspeech. This system, first marketed in 1994, provides complete articulated text-to-speech conversion using analog waveguide or line-transmission from oral and human nasal channels controlled by CarrÃÆ'à © 's typical region model.
The newer synthesizers, developed by Jorge C. Lucero and colleagues, combine vocal fold biomechanical models, glottic aerodynamics and acoustic wave propagation in bronqui, traquea, nasal and oral cavity, and thus form a full system of physics-based speech simulations.
HMM-based synthesis
HMM-based synthesis is a synthesis method based on the hidden Markov model, also called Statistical Parametric Synthesis. In this system, the frequency spectrum (vocal channel), the fundamental frequency (voice source), and the duration (prosody) of speech are simultaneously modeled by HMM. The sound waveform is generated from the HMM itself based on the maximum possible criteria.
Sinewave synthesis
Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main energy band) with pure tone whistle.
Challenges
Challenge of text normalization
The process of normalizing the text is rarely done. Full text with heteroons, numbers, and abbreviations that all require expansion into phonetic representation. There are many spellings in English spoken differently by context. For example, "My latest project is to learn how to project my sound better" contains two "project" pronunciations.
Most text-to-speech (TTS) systems do not produce semantic representation of their input text, because the process for doing so is unreliable, poorly understood, and computationally ineffective. As a result, various heuristic techniques are used to guess the proper way to discriminate against homographs, such as checking on neighboring words and using statistics about frequency of occurrences.
Recently the TTS system has begun using HMM (discussed above) to produce "speech parts" to help in a confusing homograph. This technique is quite successful for many cases such as whether "reading" should be pronounced as "red" implying a tense past, or as a "reed" implying present tense. The common error rate when using HMM in this mode is usually under five percent. These techniques also work well for most European languages, although access to necessary training corpora is often difficult in these languages.
Deciding how to convert numbers is another matter that the TTS system should address. It is a simple programming challenge to convert the numbers into words (at least in English), such as "1325" to "one thousand three hundred and twenty-five." However, numbers occur in various contexts; "1325" can also be read as "one three two five", "thirteen twenty five" or "one thousand three hundred and twenty-five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to define context if it is ambiguous. Roman numerals can also be read differently depending on the context. For example, "Henry VIII" is read as "Henry the Eighth", while "Chapter VIII" is read as "Chapter Eight".
Similarly, abbreviations can be ambiguous. For example, the abbreviations "in" for "inches" should be distinguished from the word "in", and the address "12 St. John St." using the same abbreviations for "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others give the same results in all cases, producing unreasonable (and sometimes funny) outcomes, such as "cooperation" given as "operations company".
Text-to-phoneme
The speech synthesis system uses two basic approaches to determine the pronunciation of a word based on its spelling, a process often called text-to-phoneme or grapheme-to-phoneme conversion (phonemes are terms used by linguists to describe distinctive voices in a language ). The simplest approach to text-to-phoneme conversion is a dictionary-based approach, in which a large dictionary containing all the words of a language and correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of searching every word in the dictionary and changing the spelling with the pronunciation specified in the dictionary. Another approach is rule-based, where the rules of pronunciation apply to words to determine their pronunciation based on their spelling. This is similar to "outgoing", or synthetic fiction, the approach to reading learning.
Each approach has advantages and disadvantages. The dictionary-based approach is fast and accurate, but completely fails when given a word that is not in the dictionary. As dictionary size increases, so does the memory space requirement of the synthesis system. On the other hand, rule-based approaches work on any input, but the complexity of rules grows substantially because the system considers spelling or irregular pronunciation. (Consider that the word "from" is very common in English, but is the only word where the letter "f" is pronounced [v] . As a result, almost all speech synthesis systems use this combination of approaches.
Languages ââwith phonemic orthography have a very regular writing system, and the prediction of word pronunciation based on their spelling is quite successful. Speech synthesis systems for such languages ââoften use rule-based methods extensively, switching to dictionaries for just a few words, such as foreign names and loans, whose pronunciation is not clear from their spelling. On the other hand, speech synthesis systems for languages ââlike English, which have very irregular spelling systems, are more likely to rely on dictionaries, and use rule-based methods only for unusual words, or words that are not in their dictionary.
Evaluation challenge
The consistent evaluation of a sound synthesis system may be difficult because of the lack of universally agreed objective evaluation criteria. Different organizations often use different speech data. The quality of the speech synthesis system also depends on the quality of the production technique (which may involve analog or digital recording) and on the facility used to play back the speech. Evaluating speech synthesis systems has often been compromised by the difference between production techniques and replay facilities.
But since 2005, some researchers have begun to evaluate sound synthesis systems using public speech datasets.
Prosodic and emotional content
A study in the journal Speech Communication by Amy Drahota and colleagues at the University of Portsmouth, England, reported that listeners for sound recordings can determine, to a better degree than likely, whether the speaker smiles or not. It is recommended that identification of vowel features that indicate emotional content can be used to help make speech sounds more naturally synthesized. One of the related problems is the modification of the sentence tone contour, depending on whether the sentence is an affirmative, interrogative, or exclusive sentence. One technique for pitch modification is using discrete cosine transform in the source domain (residual linear prediction). Such pitch synchronous pitch modification techniques require the marking of a priori pitch of the synthesis speech database using techniques such as epox extraction using a dynamic plosion index applied to an integrated linear prediction residue from speech-speaking areas.
Custom hardware
Teknologi Dini (tidak tersedia lagi)
- Icophone
- Votrax
- SC-01A (analog formant)
- SC-02/SSI-263/"Artic 263"
- Instrumen Umum SP0256-AL2 (CTS256A-AL2)
- National Semiconductor DT1050 Digitalker (Mozer - Forrest Mozer)
- Sistem Silikon SSI 263 (forman analog)
- Texas Instruments LPC Speech Chips
- TMS5110A
- TMS5200
- MSP50C6XX - Dijual ke Sensory, Inc. pada tahun 2001
- Hitachi HD38880BP (Vanguard Arcade Game SNK 1981)
Currently (as of 2013)
- Magnevation SpeakJet (www.speechchips.com) TTS256 Hobbies and experiments.
- Epson S1V30120F01A100 (www.epson.com) IC DECTalk Sound-based, Robotic, English/Spanish
- Textspeak TTS-EM (www.textspeak.com) ICs, Modules and Industrial attachments in 24 languages. Humans sound, Fonak-based.
Hardware and software systems
A popular system that offers speech synthesis as an installed capability.
Mattel
The game console Mattel Intellivision offered the Intellivoice Voice Synthesis module in 1982. This included Naruto SP0256 speech synthesizer chip on a removable cartridge. The narrator has 2kB Read-Only Memory (ROM), and it is used to store common databases that can be combined to create in-game Intellivision phrases. Because the Orator chip can also receive speech data from external memory, any additional words or phrases required can be stored inside the cartridge itself. The data consists of string-coefficient analog-filters to modify the behavior of vocal-channel model synthetic chips, rather than simple digital samples.
SAM
Also released in 1982, Automatic Mouth Software is a sound synthesis program of all the first commercial software. It was then used as a base for Macintalk. The program is available for non-Macintosh Apple computers (including Apple II, and Lisa), various Atari and Commodore 64 models. The Apple version prefers additional hardware that contains DAC, though it can use one-bit computer output audio (in addition to many distortion) if the card does not exist. Atari utilizes an embedded POKEY audio chip. The voice play on Atari usually disables interrupt requests and turns off ANTIC chips during vocal output. The output sounds highly distorted speech when the screen is on. The Commodore 64 utilizes an embedded SID 64 audio chip.
Atari
In a way, the first speech system integrated into the operating system is a 1400XL/1450XL personal computer designed by Atari, Inc. using the SC01 Votrax chip in 1983. The 1400XL/1450XL computer uses the Finite State Machine to enable the English-Speech World-to-synthesis text. Unfortunately, 1400XL/1450XL personal computers are never delivered in quantity.
Atari ST computers are sold with "stspeech.tos" on the floppy disk.
Apple
The first speech system integrated into the operating system that was sent in numbers was MacInTalk from Apple Computer. The software is licensed from third party developers Joseph Katz and Mark Barton (later, SoftVoice, Inc.) and displayed during the introduction of Macintosh computers in 1984. This January demo requires 512 kilobytes of RAM memory. As a result, it can not run on 128 kilobytes of RAM that was first delivered by Mac. So, the demo was done with a 512k prototype Mac, although those in attendance were not notified of this and the synthesis demo created considerable excitement for the Macintosh. In the early 1990s Apple expanded its capabilities offering extensive text-to-speech support. With the introduction of faster PowerPC-based computers, they include higher quality sound sampling. Apple also introduced voice recognition into its system that provides fluid command sets. Recently, Apple has added sample-based sound. Starting as a curiosity, Apple Macintosh talk system has evolved into a fully supported program, PlainTalk, for people with vision problems. VoiceOver is for the first time shown on Mac OS X Tiger (10.4). During the 10.4 (Tiger) and first release of 10.5 (Leopard) there is only one standard voice delivery with Mac OS X. Starting with 10.6 (Snow Leopard), users can choose from multiple sound lists. VoiceOver sounds featuring a realistic breath taking between sentences, as well as increased clarity at high read rates above PlainTalk. Mac OS X also includes say, a command line based application that converts text into audible speech. The AppleScript Standard Additions include a verb that allows scripts to use one of the installed sounds and to control pitch, speech level and modulation of the spoken text.
The Apple iOS operating system used on iPhone, iPad and iPod Touch uses VoiceOver sound synthesis for accessibility. Some third-party applications also provide speech syntheses for easy navigation, reading web pages or translating text.
AmigaOS
The second operating system to feature advanced speech synthesis capabilities is AmigaOS, introduced in 1985. Sound synthesis is licensed by Commodore International from SoftVoice, Inc., which also developed the original MacinTalk text-to-speech system. It features a complete sound emulation system for American English, with male and female voices and a "stress indicator" indicator, made possible through the Amiga audio chipset. The synthesis system is divided into translator libraries that convert unlimited English text into a standard set of phonetic codes and narrator devices that employ clear speech generation models. AmigaOS also features a high level "Speak Handler", which allows command-line users to direct output text to speech. Speech synthesis is sometimes used in third-party programs, especially word processors and educational software. The synthesis software remains largely unchanged from the first AmigaOS release and Commodore finally removed synthesis of synthesis support from AmigaOS 2.1 and beyond.
Despite the limitation of American English phonemes, unofficial versions with the synthesis of multilingual speech developed. It utilizes an enhanced version of the translator's library that can translate a number of languages, with a set of rules for each language.
Microsoft Windows
The modern Windows desktop system can use the SAPI 4 and SAPI 5 components to support speech synthesis and speech recognition. SAPI 4.0 is available as an optional add-on for Windows 95 and Windows 98. Windows 2000 adds Narrator, a text-to-speech utility for people with visual impairments. Third-party programs such as JAWS for Windows, Eye-windows, Non-visual Desktop Access, Supernova, and System Access can perform various text-to-speech tasks such as reading text aloud from certain websites, email accounts, text documents, Windows clipboard , typing the user's keyboard, etc. Not all programs can use speech synthesis directly. Some programs may use plug-ins, extensions or add-ons to read the text aloud. Available third-party programs that can read text from the system clipboard.
Microsoft Speech Server is a server-based package for synthesis and voice recognition. It's designed for network usage with web apps and call centers.
In the early 1980s, IT was known as a pioneer in speech synthesis, and the highly popular plug-in speech synthesizer module is available for TI-99/4 and 4A. Speech synthesizers are offered free with the purchase of a number of cartridges and are used by many video games written by TI (the important titles offered with speeches during this promotion are Alpiner and Parsec). Synthesizer uses a linear prediction encoding variant and has a small vocabulary built in it. The initial goal is to release a small cartridge that is attached directly to the synthesizer unit, which will enhance the device built into the vocabulary. However, the success of the text-to-speech software in the Terminal Emulator II cartridge canceled the plan.
Text-to-speech System
Text-to-Speech ( TTS ) refers to the ability of the computer to read the text aloud. TTS Machine converts written text into phonemic representations, then converts phonemic representations into waveforms that can be generated as sound. TTS engines with different languages, dialects and custom vocabulary are available through third-party publishers.
Android
Version 1.6 of Android adds support for speech synthesis (TTS).
Internet
Currently, there are a number of apps, plugins and gadgets that can read messages directly from email clients and web pages from a web browser or Google Toolbar, like Text to Voice, which is an add-on to Firefox. Some special software can narrate RSS-feeds. On the one hand, the online RSS browser simplifies sending information by allowing users to listen to their favorite news source and turn it into a podcast. On the other hand, an on-line RSS reader is available on almost any PC connected to the Internet. Users can download the generated audio file to a portable device, e.g. with the help of podcast recipients, and listen to them while walking, jogging or leaving for work.
A growing field in Internet-based TTS is web-based help technology, e.g. 'Browsealoud' from a UK company and Readspeaker. It can provide TTS functionality to anyone (for reasons of accessibility, convenience, entertainment or information) with access to a web browser. The nonprofit program pediaphon was created in 2006 to provide a web-based TTS interface similar to Wikipedia.
Other work is underway in the context of the W3C through the W3C Audio Incubator Group with the involvement of the BBC and Google Inc.
Open source
Systems operating on free and open source software systems including Linux are diverse, and include open source programs such as the Speech Synthesis System Festival that uses tone-based synthesis, as well as more modern and better techniques, eSpeak, which supports multiple languages, and gnuspeech that use the articulation synthesis of the Free Software Foundation.
More
- Following a hardware-based Intellivoice commercial failure, game developers sparingly use software synthesis in the next game. A notable example is the introductory narration of Super Metroid Nintendo games for Super Nintendo Entertainment System. Previous systems from Atari, such as Atari 5200 (Baseball) and Atari 2600 (Quadrun and Open Sesame), also have games that utilize software synthesis.
- Some e-book readers, such as Amazon Kindle, Samsung E6, PocketBook eReader Pro, enTourage eDGe, and Bebook Neo.
- The BBC Micro uses the Texas Instruments TMS5220 speech synthesis chip,
- Some of Texas Instruments' home computer models manufactured in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were able to synthesize text-to-phonem or read full words and phrases (text-to -phoneme). dictionary), using the highly popular Synthesizer Speech tool. IT uses proprietary codecs to embed complete oral phrases into applications, especially video games.
- IBM OS/2 Warp 4 includes VoiceType, the precursor to IBM ViaVoice.
- GPS Navigation Units produced by Garmin, Magellan, TomTom, and others use speech synthesis for car navigation.
- Yamaha produced music synthesizers in 1999, Yamaha FS1R that included the ability of Formant synthesis. Sequences of up to 512 vowels and individual consonant formers can be stored and played back, allowing short vocal phrases to be synthesized.
- Taiwan Speech Notepad is Taiwan's corpus-based text-to-speech concatenation system for Microsoft Windows XP/Win7. There are three major components in the software; Taiwan tone type parser, speech engine, and speech synthesizer. The system is installed directly on the PC to operate independently without connecting the MS Speech SDK or IBM TTS Engine. The user's graphical interface includes functions such as Taiwan awarded or traditional Chinese inputs, sync sound dictionaries, Chinese/English for Taiwanese word index search, external browser speech/output speech programs and book-making for reading disabilities.
Digital sound-alikes
With the introduction of Adobe Voco 2016 audio editing and resulting in software prototypes scheduled to become part of Adobe Creative Suite and also enable DeepMind WaveNet, the deep neural network audio synthesis software from Google speech synthesis is very difficult to distinguish from real human voice.
Adobe Voco takes about 20 minutes from the desired target speech and can then produce a similar sound even with phonemes that are not in the training material. This software obviously raises ethical issues because it makes it possible to steal other people's voices and manipulate them to say whatever they want.
This increases the pressure on the disinformation situation coupled with the facts
- Human image synthesis since the early 2000s has risen beyond the point of human inability to tell real humans who are imaged with real cameras from human simulations imaged by camera simulations.
- The 2D video counterfeiting technique is presented in 2016 that allows real-time facial falsification in existing 2D videos.
Language synthesis markup language
A number of markup languages ââhave been defined for text rendition as greeting in XML-compliant format. The latest is Speech Synthesis Markup Language (SSML), which was recommended by W3C in 2004. The older language synthesis markup languages ââinclude Java Speech Markup Language (JSML) and SABLE. Although each is proposed as a standard, none of them have been widely adopted.
The speech synthesis markup language is distinguished from the markup language of the dialog. VoiceXML, for example, includes tags related to speech recognition, dialog management and touch dialing, in addition to text-to-speech markup.
Apps
Speech synthesis has long been an important assistive technology tool and its application in this area is very significant and widespread. This enables environmental barriers to be removed for people with disabilities. The longest application has used screen readers for people with visual impairments, but the text-to-speech system is now commonly used by people with dyslexia and other reading difficulties as well as by pre-literate children. They are also often used to help those with severe speech disorders usually through the help of a special voice output communication.
Speech synthesis techniques are also used in the production of entertainment such as games and animation. In 2007, Animo Limited announced the development of a suite of software applications based on FineSpeech's speech synthesis software, explicitly directed to customers in the entertainment industry, capable of generating narratives and dialogue lines according to user specifications. This application reached maturity in 2008, when NEC Biglobe announced a web service that allows users to create phrases from Code Geass sounds: Lelouch from Rebellion R2 characters.
In recent years, Text to Speech for disability and disabled communication tools have been widely used in Mass Transit. Text to Speech also finds new applications outside the disability market. For example, speech synthesis, combined with speech recognition, allows interaction with mobile devices through natural language processing interfaces.
Text-to speech is also used in the acquisition of a second language. Voki, for example, is an educational tool created by Oddcast that allows users to create their own speaking avatars, using different accents. They can be emailed, embedded on websites, or shared on social media.
In addition, speech synthesis is a valuable computing aid for analysis and assessment of speech impairment. A sound quality synthesizer, developed by Jorge C. Lucero et al. at the University of Brasilia, simulates the physics of phonation and includes jitter and vibration models of vocal sound, airflow noise and laryngeal asymmetry. Synthesizers have been used to mimic dysphonic speaker timbre with roughness, breathiness and controlled strains.
API
Some companies offer TTS APIs to their customers to accelerate the development of new applications using TTS technology. Companies offering API TTS include AT & amp; T, CereProc, DIOTEK, IVONA, Neospeech, Readspeaker, SYNVO, YAKiToMe !, Yandex and CPqD. For mobile app development, the Android operating system has been offering text to speech APIs for a long time. Recently, with iOS7, Apple started offering APIs for text to speech.
See also
References
External links
- Synthesis of speech in Curlie (based on DMOZ)
- MARY Web Client (
Research Center for Artificial Intelligence) - Dennis Klatt Synthesis Analysis
- Simulation of singing with a robot singing Pavarobotti or a description of the BBC about how robots synthesize singing.
- TTS Chrome Demo
- Notepad Greeting Taiwan
Source of the article : Wikipedia