Acoustic model - Knowledge (XXG)

22: 186:(i.e. samples per second – the most common being: 8, 16, 32, 44.1, 48, and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits, 24-bits or 32-bits). Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized. 198:

based speech recognition is the bandwidth at which speech can be transmitted. For example, a standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony

134:

or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make

151:

to represent the statistical properties of speech. The acoustic model models the relationship between the audio signal and the phonetic units in the language. The language model is responsible for modeling the word sequences in the language. These two models are combined to get the top-ranked word

226:

As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample can slow the recognition engine down. A compromise is needed. Thus for

210:

determines the sampling rate/bits per sample of speech transmission. Codecs with a higher sampling rate/bits per sample for speech transmission (which improve the sound quality) necessitate acoustic models trained with audio data that matches that sampling rate/bits per sample.

159:

recognition systems operate on the audio in small chunks known as frames with an approximate duration of 10ms per frame. The raw audio signal from each frame can be transformed by applying the

163:. The coefficients from this transformation are commonly known as mel frequency cepstral coefficients (MFCC)s and are used as an input to the acoustic model along with other features. 223:. Most sound cards today can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 96 kHz. 227:

desktop speech recognition, the current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits per sample.

301: 105: 39: 86: 43: 58: 167: 123: 306: 65: 32: 72: 54: 160: 264: 199:

based speech recognition, acoustic models should be trained with 8 kHz/8-bit speech audio files.

144: 273: 268: 79: 286: 148: 295: 203: 183: 179: 127: 21: 220: 219:

For speech recognition on a standard desktop PC, the limiting factor is the

195: 282: 277: 131: 156: 207: 15: 46:. Unsourced material may be challenged and removed. 170:has led to big improvements in acoustic modeling. 152:sequences corresponding to a given audio segment. 247:., "Convolutional neural networks for LVCSR," 8: 147:systems use both an acoustic model and a 126:to represent the relationship between an 106:Learn how and when to remove this message 236: 7: 44:adding citations to reliable sources 190:Telephony-based speech recognition 14: 215:Desktop-based speech recognition 20: 31:needs additional citations for 1: 168:Convolutional Neural Networks 174:Speech audio characteristics 124:automatic speech recognition 274:open source acoustic models 323: 302:Computational linguistics 265:Japanese acoustic models 194:The limiting factor for 283:HTK WSJ acoustic models 161:mel-frequency cepstrum 166:Recently, the use of 40:improve this article 307:Speech recognition 145:speech recognition 267:for the use with 116: 115: 108: 90: 314: 252: 241: 111: 104: 100: 97: 91: 89: 55:"Acoustic model" 48: 24: 16: 322: 321: 317: 316: 315: 313: 312: 311: 292: 291: 261: 256: 255: 242: 238: 233: 217: 202:In the case of 192: 176: 141: 112: 101: 95: 92: 49: 47: 37: 25: 12: 11: 5: 320: 318: 310: 309: 304: 294: 293: 290: 289: 280: 271: 260: 259:External links 257: 254: 253: 235: 234: 232: 229: 216: 213: 191: 188: 184:sampling rates 175: 172: 149:language model 140: 137: 135:up each word. 120:acoustic model 114: 113: 28: 26: 19: 13: 10: 9: 6: 4: 3: 2: 319: 308: 305: 303: 300: 299: 297: 288: 284: 281: 279: 275: 272: 270: 266: 263: 262: 258: 250: 246: 240: 237: 230: 228: 224: 222: 214: 212: 209: 205: 204:Voice over IP 200: 197: 189: 187: 185: 182:at different 181: 178:Audio can be 173: 171: 169: 164: 162: 158: 153: 150: 146: 138: 136: 133: 129: 125: 121: 110: 107: 99: 96:February 2011 88: 85: 81: 78: 74: 71: 67: 64: 60: 57: – 56: 52: 51:Find sources: 45: 41: 35: 34: 29:This article 27: 23: 18: 17: 248: 244: 239: 225: 218: 201: 193: 177: 165: 155:Most modern 154: 142: 128:audio signal 119: 117: 102: 93: 83: 76: 69: 62: 50: 38:Please help 33:verification 30: 243:T. Sainath 122:is used in 296:Categories 231:References 221:sound card 139:Background 66:newspapers 196:telephony 278:VoxForge 132:phonemes 130:and the 251:, 2013. 180:encoded 143:Modern 80:scholar 269:Julius 249:ICASSP 245:et al. 206:, the 157:speech 82: 75: 68: 61: 53: 208:codec 87:JSTOR 73:books 285:for 59:news 287:HTK 276:at 118:An 42:by 298:: 109:) 103:( 98:) 94:( 84:· 77:· 70:· 63:· 36:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index