Data Reverse Engineering is a rapidly growing field of research to make possible the evolution towards the Web 2.0. In this scenario, Web content should be self-descriptive to be automatically interpreted and, possibly, used differently from their original goal. The majority of documents on the Web are written in HTML, constituting a huge amount of legacy data. All documents are formatted for visual purposes only and with different styles due to diverse authorship and goals of the people writing these documents. This makes the process of retrieval and integration of Web content difficult to automate. We propose a structured approach to data reverse engineering of data-intensive HTML Web sites. We focus on data content and on the way in which such content is structured on the Web. Our approach exploits a Web site data model to describe abstract structural features of HTML pages. Such model can be profitably used to segment HTML documents in special blocks (Web entity blocks) grouping semantically related objects. A framework was developed using methods and tools supporting the identification of structure, function, and meaning of data organized in Web entity blocks. We demonstrate with this framework the feasibility and effectiveness of our approach over a set of real-life Web sites.

 

Authors

  1. -Roberto De Virgilio

  2. -Riccardo Torlone


Organization

Roma Tre - DIA

Via Vasca Navale 79

00146 - Rome - Italy