2005 ACM International Conference on Information and Knowledge Management (CIKM '05)
Oct. 31 - Nov. 05, 2005, Bremen, Germany

ViPER: Augmenting Automatic Information Extraction with Visual Perceptions

Kai Simon, Georg Lausen,

Abstract:

In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. Hereby the extraction rules are generated automatically without any training or human interaction, by means of operating on the DOM tree respectively the flat tag token sequence of a single page. Our contribution to automatic data extraction through this paper is twofold. First, we identify and rank potential repetitive patterns with respect to the userís visual perception of the Web page, well aware that location and size of matching elements within a Web page constitute important criteria for defining relevance. Second, matching subsequences of the pattern with the highest weightiness are aligned with global multiple sequence alignment techniques. Experimental results show that our system is able to achieve high accuracy in distilling and aligning the relevant results inside complex Web pages.