skip to main content
10.1145/1141753.1141777acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Combining DOM tree and geometric layout analysis for online medical journal article segmentation

Published: 11 June 2006 Publication History

Abstract

We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.

References

[1]
Baird, H.S., Jones, S.E., and Fortune, S.J., Image Segmentation by Shape-Directed Covers, Proc. International Conference Pattern Recognition, pp. 820--825, 1990.
[2]
Buyukkokten, O., Garcia Molina, H., and Paepche, A., Accordion Summary for End-Game Browsing on PDAs and Cellular Phones, Proc. of Conference on Human Factors in Computer Systems, 2001.
[3]
Cai, D., Yu, S., Wen, J. R., and Ma, W. Y., Extracting Content Structure for Web Pages Based on Visual Representation, Proc. of 5th Asia Pacific Web Conference, 2003.
[4]
Cai, D., Yu, S., Wen J. R., and Ma, W. Y., VIPS: a Vision-Based Page Segmentation Algorithm, Microsoft Technical Report (MSR-TR-2003-79), 2003.
[5]
Chen, J., Zhou, B., Shi, J., Zhang, H., and Wu, Q., Function-Based Object Model towards Website Adaptation, Proc. 10th International World Wide Web Conference, 2001.
[6]
Diao, Y., Lu, H., Chen, S., and Tian, Z., Toward Learning Based Web Query Processing, Proc. of International Conference on Very Large Databases, pp. 317--328, 2000.
[7]
Ha, J., Haralick, R., and Phillips, I., Recursive X-Y Cut Using Bounding Boxes of Connected Components, Proc. 3rd International Conference Document Analysis and Recognition, pp. 952--955, 1995.
[8]
Hauser, S.E., Le D.X., and Thoma G.R., Automated zone correction in bitmapped document images, Proc. SPIE: Document Recognition and Retrieval VII, SPIE Vol. 3976, San Jose, CA, pp. 248--258, 2000.
[9]
Jain, A.K. and Yu B., Document Representation and Its Application to Page Decomposition, IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 20, no. 3, pp. 294--308, 1998.
[10]
Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., and Laakko, T., Two Approaches to Bringing Internet Services to WAP Devices, Proc. 9th International World Wide Web Conference, pp. 231--246, 2000.
[11]
Lin, S. H., and Ho, J. M., Discovering Informative Content Blocks from Web Documents, Proc. of ACM SIGKDD, 2002.
[12]
Marini, J., The Document Object Model, Processing Structured Documents, McGraw-Hill/Osborne, 2002.
[13]
Nagy, G., Seth, S., and Viswanathan, M., A Prototype Document Image Analysis System for Technical Journals, Computer, vol. 25, pp. 10--22, 1992.
[14]
Nagy, G., Twenty Years of Document Image Analysis in PAMI, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp.38--62, 2000.
[15]
O'Gorman, L., The Document Spectrum for Page Layout Analysis, IEEE Trans. Pattern Recognition and Machine Intelligence, vol. 15, pp. 1162--1173, 1993.
[16]
Pavlidis, T., and Zhou, J., Page Segmentation and Classification, Graphical Models and Image Processing, vol. 54, pp. 484--496, 1992.
[17]
http://www.w3.org/DOM/
[18]
http://pdftohtml.sourceforge.net/

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
June 2006
402 pages
ISBN:1595933549
DOI:10.1145/1141753
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HTML document segmentation
  2. document layout analysis
  3. document object model (DOM)
  4. web information retrieval

Qualifiers

  • Article

Conference

JCDL06
JCDL06: Joint Conference on Digital Libraries 2006
June 11 - 15, 2006
NC, Chapel Hill, USA

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media