22:
186:(i.e. samples per second – the most common being: 8, 16, 32, 44.1, 48, and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits, 24-bits or 32-bits). Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized.
198:
based speech recognition is the bandwidth at which speech can be transmitted. For example, a standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony
134:
or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make
151:
to represent the statistical properties of speech. The acoustic model models the relationship between the audio signal and the phonetic units in the language. The language model is responsible for modeling the word sequences in the language. These two models are combined to get the top-ranked word
226:
As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample can slow the recognition engine down. A compromise is needed. Thus for
210:
determines the sampling rate/bits per sample of speech transmission. Codecs with a higher sampling rate/bits per sample for speech transmission (which improve the sound quality) necessitate acoustic models trained with audio data that matches that sampling rate/bits per sample.
159:
recognition systems operate on the audio in small chunks known as frames with an approximate duration of 10ms per frame. The raw audio signal from each frame can be transformed by applying the
163:. The coefficients from this transformation are commonly known as mel frequency cepstral coefficients (MFCC)s and are used as an input to the acoustic model along with other features.
223:. Most sound cards today can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 96 kHz.
227:
desktop speech recognition, the current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits per sample.
301:
105:
39:
86:
43:
58:
167:
123:
306:
65:
32:
72:
54:
160:
264:
199:
based speech recognition, acoustic models should be trained with 8 kHz/8-bit speech audio files.
144:
273:
268:
79:
286:
148:
295:
203:
183:
179:
127:
21:
220:
219:
For speech recognition on a standard desktop PC, the limiting factor is the
195:
282:
277:
131:
156:
207:
15:
46:. Unsourced material may be challenged and removed.
170:has led to big improvements in acoustic modeling.
152:sequences corresponding to a given audio segment.
247:., "Convolutional neural networks for LVCSR,"
8:
147:systems use both an acoustic model and a
126:to represent the relationship between an
106:Learn how and when to remove this message
236:
7:
44:adding citations to reliable sources
190:Telephony-based speech recognition
14:
215:Desktop-based speech recognition
20:
31:needs additional citations for
1:
168:Convolutional Neural Networks
174:Speech audio characteristics
124:automatic speech recognition
274:open source acoustic models
323:
302:Computational linguistics
265:Japanese acoustic models
194:The limiting factor for
283:HTK WSJ acoustic models
161:mel-frequency cepstrum
166:Recently, the use of
40:improve this article
307:Speech recognition
145:speech recognition
267:for the use with
116:
115:
108:
90:
314:
252:
241:
111:
104:
100:
97:
91:
89:
55:"Acoustic model"
48:
24:
16:
322:
321:
317:
316:
315:
313:
312:
311:
292:
291:
261:
256:
255:
242:
238:
233:
217:
202:In the case of
192:
176:
141:
112:
101:
95:
92:
49:
47:
37:
25:
12:
11:
5:
320:
318:
310:
309:
304:
294:
293:
290:
289:
280:
271:
260:
259:External links
257:
254:
253:
235:
234:
232:
229:
216:
213:
191:
188:
184:sampling rates
175:
172:
149:language model
140:
137:
135:up each word.
120:acoustic model
114:
113:
28:
26:
19:
13:
10:
9:
6:
4:
3:
2:
319:
308:
305:
303:
300:
299:
297:
288:
284:
281:
279:
275:
272:
270:
266:
263:
262:
258:
250:
246:
240:
237:
230:
228:
224:
222:
214:
212:
209:
205:
204:Voice over IP
200:
197:
189:
187:
185:
182:at different
181:
178:Audio can be
173:
171:
169:
164:
162:
158:
153:
150:
146:
138:
136:
133:
129:
125:
121:
110:
107:
99:
96:February 2011
88:
85:
81:
78:
74:
71:
67:
64:
60:
57: –
56:
52:
51:Find sources:
45:
41:
35:
34:
29:This article
27:
23:
18:
17:
248:
244:
239:
225:
218:
201:
193:
177:
165:
155:Most modern
154:
142:
128:audio signal
119:
117:
102:
93:
83:
76:
69:
62:
50:
38:Please help
33:verification
30:
243:T. Sainath
122:is used in
296:Categories
231:References
221:sound card
139:Background
66:newspapers
196:telephony
278:VoxForge
132:phonemes
130:and the
251:, 2013.
180:encoded
143:Modern
80:scholar
269:Julius
249:ICASSP
245:et al.
206:, the
157:speech
82:
75:
68:
61:
53:
208:codec
87:JSTOR
73:books
285:for
59:news
287:HTK
276:at
118:An
42:by
298::
109:)
103:(
98:)
94:(
84:·
77:·
70:·
63:·
36:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.