CMDR: Classifying Nodes for Mining Data Records with Different HTML Structures

CMDR: Classifying Nodes for Mining Data Records with Different HTML Structures
Title:
CMDR: Classifying Nodes for Mining Data Records with Different HTML Structures
Other Titles:
IEEE TENCON2017
DOI:
Publication URL:
Publication Date:
05 November 2017
Citation:
Abstract:
This paper addresses the problem of automated structured data records extraction from web pages. In particular, we focus on the extraction of posts from online forum sites. We show that variability in the HTML structure within user generated content in forum posts can negatively affect the extraction accuracy and propose the integration of a deep learning node classifier in the popular Mining Data Regions (MDR) process proposed in prior work. Experiment on a forum web page dataset containing posts with varying HTML structures indicate the merits of the proposed modification for MDR.
License type:
PublisherCopyrights
Funding Info:
Singapore National Research Foundation
Description:
ISBN:

Files uploaded:

File Size Format Action
cmdr-tencon-camerareadyversion-0929.pdf 368.20 KB PDF Open