106:. are highly accurate with good and representative samples of the problem, but perform badly with biased data. Most of times better model presentation with more careful and unbiased representation of input and output is enough. A particularly relevant area where finding the appropriate structure and model is the key issue is
116:
is the standard mechanism used to refer to nodes and data items within XML. It has similarities to standard techniques for navigating directory hierarchies used in operating systems user interfaces. To data and structure mine XML data of any form, at least two extensions are required to conventional
89:
Such data presents large problems for conventional data mining. Two messages that conform to the same schema may have little data in common. Building a training set from such data means that if one were to try to format it as tabular data for conventional data mining, large sections of the tables
69:. Much of the world's interesting and mineable data does not easily fold into relational databases, though a generation of software engineers have been trained to believe this was the only way to handle data, and data mining algorithms have generally been developed only to cope with tabular data.
120:
As an example, if one were to represent a family tree in XML, using these extensions one could create a data set containing all the individuals node in the tree, data items such as name and age at death, and counts of related nodes, such as number of children. More sophisticated searches could
93:
There is a tacit assumption made in the design of most data mining algorithms that the data presented will be complete. The other necessity is that the actual mining algorithms employed, whether supervised or unsupervised, must be able to handle sparse data. Namely, machine learning algorithms
75:, being the most frequent way of representing semi-structured data, is able to represent both tabular data and arbitrary trees. Any particular representation of data to be exchanged between two applications in XML is normally described by a schema often written in
117:
data mining. These are the ability to associate an XPath statement with any data pattern and sub statements with each data node in the data pattern, and the ability to mine the presence and count of any node or set of nodes within the document.
83:, are normally very sophisticated, containing multiple optional subtrees, used for representing special case data. Frequently around 90% of a schema is concerned with the definition of these optional data items and sub-trees.
86:
Messages and data, therefore, that are transmitted or encoded using XML and that conform to the same schema are liable to contain very different data depending on what is being transmitted.
61:
has created new opportunities for data mining, which has traditionally been concerned with tabular data sets, reflecting the strong association between
209:
195:
177:
248:
94:
perform badly with incomplete data sets where only part of the information is supplied. For instance methods based on
124:
The addition of these data types related to the structure of a document or message facilitates structure mining.
95:
42:
332:
328:
257:
359:
143:
530:
489:
66:
58:
38:
429:
414:
342:
155:
241:
138:
419:
409:
273:
205:
191:
173:
369:
318:
303:
283:
268:
499:
434:
424:
394:
337:
308:
298:
46:
222:
The 5th
International Workshop on Mining and Learning with Graphs, Firenze, Aug 1-3, 2007
509:
504:
469:
449:
444:
399:
374:
293:
166:
Algorithms on
Strings, Trees, and Sequences: Computer Science and Computational Biology
17:
524:
474:
464:
439:
313:
278:
234:
103:
484:
479:
459:
454:
384:
379:
354:
347:
323:
133:
99:
200:
F. Hadzic, H. Tan, T.S. Dillon, Mining of Data with
Complex Structures, Springer,
404:
364:
107:
62:
494:
389:
288:
80:
113:
37:
is the process of finding and extracting useful information from
27:
Finding and extracting information from semi-structured data sets
230:
160:, Data mining UK conference, University of Nottingham, Aug 2003
76:
72:
221:
226:
79:. Practical examples of such schemata, for instance
121:extract data such as grandparents' lifespans etc.
242:
49:are special cases of structured data mining.
8:
249:
235:
227:
158:On data mining tree structured data in XML
7:
182:R.O. Duda, P.E. Hart, D.G. Stork,
25:
168:, Cambridge University Press,
1:
547:
264:
186:, John Wiley & Sons,
90:would or could be empty.
57:The growth of the use of
43:sequential pattern mining
184:Pattern Classification
35:structured data mining
18:Structured data mining
144:Inductive programming
430:Protection (privacy)
67:relational databases
59:semi-structured data
41:sets. Graph mining,
39:semi-structured data
156:Andrew N Edmonds,
139:Structured content
518:
517:
510:Wrangling/munging
360:Format management
210:978-3-642-17556-5
16:(Redirected from
538:
251:
244:
237:
228:
31:Structure mining
21:
546:
545:
541:
540:
539:
537:
536:
535:
521:
520:
519:
514:
490:Synchronization
260:
255:
218:
152:
130:
96:neural networks
55:
47:molecule mining
28:
23:
22:
15:
12:
11:
5:
544:
542:
534:
533:
523:
522:
516:
515:
513:
512:
507:
502:
497:
492:
487:
482:
477:
472:
467:
462:
457:
452:
447:
442:
437:
432:
427:
422:
417:
415:Pre-processing
412:
407:
402:
397:
392:
387:
382:
377:
372:
367:
362:
357:
352:
351:
350:
345:
340:
326:
321:
316:
311:
306:
301:
296:
291:
286:
281:
276:
271:
265:
262:
261:
256:
254:
253:
246:
239:
231:
225:
224:
217:
216:External links
214:
213:
212:
198:
180:
164:Gusfield, D.,
162:
151:
148:
147:
146:
141:
136:
129:
126:
54:
51:
26:
24:
14:
13:
10:
9:
6:
4:
3:
2:
543:
532:
529:
528:
526:
511:
508:
506:
503:
501:
498:
496:
493:
491:
488:
486:
483:
481:
478:
476:
473:
471:
468:
466:
463:
461:
458:
456:
453:
451:
448:
446:
443:
441:
438:
436:
433:
431:
428:
426:
423:
421:
418:
416:
413:
411:
408:
406:
403:
401:
398:
396:
393:
391:
388:
386:
383:
381:
378:
376:
373:
371:
368:
366:
363:
361:
358:
356:
353:
349:
346:
344:
341:
339:
336:
335:
334:
330:
327:
325:
322:
320:
317:
315:
312:
310:
307:
305:
302:
300:
297:
295:
292:
290:
287:
285:
282:
280:
277:
275:
272:
270:
267:
266:
263:
259:
252:
247:
245:
240:
238:
233:
232:
229:
223:
220:
219:
215:
211:
207:
203:
199:
197:
196:0-471-05669-3
193:
189:
185:
181:
179:
178:0-521-58519-8
175:
171:
167:
163:
161:
159:
154:
153:
149:
145:
142:
140:
137:
135:
132:
131:
127:
125:
122:
118:
115:
111:
109:
105:
104:ID3 algorithm
101:
97:
91:
87:
84:
82:
78:
74:
70:
68:
64:
60:
52:
50:
48:
44:
40:
36:
32:
19:
420:Preservation
410:Philanthropy
274:Augmentation
201:
187:
183:
169:
165:
157:
134:Graph kernel
123:
119:
112:
100:Ross Quinlan
92:
88:
85:
71:
56:
34:
30:
29:
531:Data mining
480:Stewardship
370:Integration
319:Degradation
304:Compression
284:Archaeology
269:Acquisition
108:text mining
63:data mining
53:Description
500:Validation
435:Publishing
425:Processing
395:Management
309:Corruption
299:Collection
150:References
505:Warehouse
470:Scrubbing
450:Retention
445:Reduction
400:Migration
375:Integrity
343:Transform
294:Cleansing
525:Category
475:Security
465:Scraping
440:Recovery
314:Curation
279:Analysis
128:See also
485:Storage
460:Science
455:Quality
385:Lineage
380:Library
355:Farming
338:Extract
324:Editing
405:Mining
365:Fusion
208:
194:
176:
81:NewsML
114:XPath
98:. or
495:Type
390:Loss
348:Load
258:Data
206:ISBN
202:2010
192:ISBN
188:2001
174:ISBN
170:1997
65:and
45:and
333:ELT
329:ETL
289:Big
102:'s
77:XSD
73:XML
33:or
527::
204:.
190:.
172:.
110:.
331:/
250:e
243:t
236:v
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.