Document processing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:05, 29 September 2021 edit Serols (talk \| contribs) Extended confirmed users, Pending changes reviewers 477,228 edits Undid revision 1047228067 by Vidhya raja priya (talk) Wikipedia:Weblinks Tag: Undo ← Previous edit		Latest revision as of 16:37, 28 August 2024 edit undo GrobbaZA81 (talk \| contribs) 1 edit →Automatic document processing Tag: Visual edit
(24 intermediate revisions by 20 users not shown)
Line 1: {{Short description\|Digitalisation of analog documents}} '''Document processing''' is a field of research and a set of [[production process]]es aimed at making an analog [[document]] digital. Document processing does not simply aim to photograph or [[Image scanning\|scan]] a document to obtain a [[digital image]], but also to make it digitally intelligible. This includes extracting the structure of the document or the [[Document layout analysis\|layout]] and then the content, which can take the form of text or images. The process can involve traditional [[computer vision]] algorithms, [[convolutional neural networks]] or manual labor. The problems addressed are related to [[semantic segmentation]], [[object detection]], [[optical character recognition\|optical character recognition (OCR)]], [[Handwritten text recognition\|handwritten text recognition (HTR)]] and, more broadly, [[Transcription (linguistics)\|transcription]], whether [[Automation\|automatic]] or not.<ref>{{Cite book \|url=https://books.google.com/books?id=gYOpFlMXcs0C&q=%22document+processing%22+ocr&pg=PA368 \|title=Integrative Document & Content Management: Strategies for Exploiting Enterprise Knowledge \|author1=Len Asprey \|author2=Michael Middleton \|date=2003 \|publisher=Idea Group Inc (IGI) \|isbn=9781591400554}}</ref> The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using [[~~Natural language processing\|~~natural language processing]] (NLP)]] or [[image classification]] technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog [[Archiving\|archives]] and historical documents. ==Background== Document processing was initially as is still to some ~~extend~~extent a kind of production line work dealing with the treatment of [[document]]s, such as letters and parcels, in an aim of sorting, extracting or massively extracting data. This work could be performed in-house or through [[business process outsourcing]].<ref>{{Cite book▼ ▲Document processing was initially as is still to some extend a kind of production line work dealing with the treatment of [[document]]s, such as letters and parcels, in an aim of sorting, extracting or massively extracting data. This work could be performed in-house or through [[business process outsourcing]].<ref>{{Cite book \|url=https://books.google.com/books?id=g4dxNB05dgoC&q=document+processing+bpo&pg=PA47 \|title=Business Process Outsourcing: A Supply Chain of Expertises Line 15: \|title=Outsourcing to India: The Offshore Advantage \|author=Mark Kobayashi-Hillary \|date=2005-12-05 \|publisher=Springer Science & Business Media \|isbn=9783540247944}}</ref> Document processing can indeed involve some kind of externalized manual labor, such as [[Amazon Mechanical Turk\|mechanical ~~turk~~Turk]]. As an example of manual document processing, as relatively recent as 2007,<ref name="VisaDox">{{cite news \|newspaper=[[The New York Times]] \|url=https://www.nytimes.com/2007/12/02/us/02immig.html \|title=Immigration Contractor Trims Wages \|author=Julia Preston \|date=December 2, 2007}}</ref> document processing for "millions of visa and citizenship applications" was about use of "approximately 1,000 contract workers" working to "manage ~~mailroom~~mail room and [[data entry clerk\|data entry]]." While document processing involved data entry via keyboard well before use of a [[computer mouse]] or a [[Image scanner\|computer scanner]], a 1990 article in ''[[The New York Times]]'' regarding what it called the "[[paperless office]]" stated that "document processing begins with the scanner.".<ref name="Paper.NYT">{{cite news\|newspaper=[[The New York Times]] \|url=https://www.nytimes.com/1990/07/07/business/paper-once-written-off-keeps-a-place-in-the-office.html \|title=Paper, Once Written Off, Keeps a Place in the Office \|author=Lawrence M. Fisher \|date=July 7, 1990}}</ref> In this context, a former [[Xerox]] ~~Vice~~vice-president, Paul Strassman, expressed a critical opinion, saying that computers add rather than reduce the volume of paper in an office.<ref name="Paper.NYT"/> It was said that the engineering and maintenance documents for an airplane weigh "more than the airplane itself"{{citation needed\|date=April 2019}}. ==Automatic document processing== Line 32: \|author1=Al Young \|author2=Dayle Woolstein \|author3=Jay Johnson}}</ref> A technology called automatic document processing or sometimes intelligent document processing (IDP) emerged as a specific form of [[Process Automation\|Intelligent Process Automation]] (IPA), combining [[artificial intelligence]] such as [[Machine Learning]] (ML), [[Natural Language Processing]] (NLP) or [[Intelligent Character Recognition]] (~~ICR~~ICE) to extract data from several types documents.<ref>{{Cite web\|url=http://www.di.uniba.it/~ndm/pubs/esposito05icdar.pdf\|title=Intelligent Document processing ~~by Floriana Esposito , Stefano Ferilli , Teresa M. A. Basile , Nicola Di Mauro~~\|date=2005-04-07\|website=Department of Computer Science – University of Bari\|access-date=2018-09-08}}</ref><ref>{{Cite book \|url=https://www.computer.org/csdl/proceedings-article/icdar/2005/24201100/12OmNqIQS59 \|title="Intelligent Document Processing" in Proceedings. Eighth International Conference on Document Analysis and Recognition, Seoul, South Korea, 2005 pp. 1100-1104. doi: 10.1109/ICDAR.2005.144 \|author=[[Floriana Esposito ]], Stefano Ferilli , Teresa M. A. Basile , Nicola Di Mauro \|date=2005-04-01 \|publisher= \|doi=10.1109/ICDAR.2005.144 \|isbn=\|s2cid=17302169 }}</ref> Advancements in automatic document processing, also called Intelligent Document Processing, improve the ability to process [[unstructured data]] with fewer exceptions and greater speeds. <ref>{{Cite web \|title=Intelligent Document Processing (IDP) \|url=https://www.keymarkinc.com/intelligent-document-processing-idp/ \|access-date=2024-07-12 \|website=keymarkinc.com \|language=en-US}}</ref> ~~}}</ref>~~ === Applications === Automatic document processing applies to a whole range of documents, whether structured or not. For instance, in the world of business and finance, technologies may be used to process paper-based invoices, forms, purchase orders, contracts, and currency bills.<ref>{{cite patent \|country=US\|number=US7873576B2\|status=active\|title= Financial document processing system \|pubdate=2011-01-18\|gdate=2011-01-18\|invent1=John E. Jones\|invent2=William J. Jones\|invent3=Frank M. Csultis\|url=https://patents.google.com/patent/US7873576B2/en}}</ref> Financial institutions use intelligent document processing to process high volumes of forms such as regulatory forms or loan documents. ~~IDP~~ID uses AI to extract and classify data from documents, replacing manual data entry.<ref>{{Cite web\|last=Bridgwater\|first=Adrian\|title=Appian Adds Google Cloud Intelligence To Low-Code Automation Mix\|url=https://www.forbes.com/sites/adrianbridgwater/2020/03/09/appian-adds-google-cloud-intelligence-to-low-code-automation-mix/\|access-date=2021-04-21\|website=Forbes\|language=en}}</ref> In medicine, document processing methods have been developed to facilitate patient follow-up and streamline administrative procedures, in particular by digitizing medical or laboratory analysis reports. The goal is also to standardize medical databases.<ref>{{cite journal \|last1=Adamo\|first1=Francesco\|last2=Attivissimo\|first2=Filippo\|first3=Attilio\|last3=Di Nisio\|first4=Maurizio\|last4=Spadavecchia\|date=February 2015\|title=An automatic document processing system for medical data extraction\|url=https://www.sciencedirect.com/science/article/pii/S0263224114005016~~?casa_token=heNt0-b7OZQAAAAA:uvH1ZRZOGIw3XbLF245ePomFOwVl8svsdARmurepRv7G7vcISA9Gc8d5ZhmcG6Y0RJTJeJ2Sn98~~\|journal=Measurement\|volume=61\|pages=88–99 \|doi=10.1016/j.measurement.2014.10.032\|bibcode=2015Meas...61...88A \|access-date=31 January 2021}}</ref> Algorithms are also directly used to assist ~~the~~ physicians in medical diagnosis, e.g. by analyzing [[Magnetic resonance imaging\|magnetic resonance images]],<ref>{{cite journal \|last1=Changwan\|first1=Kim\|last2=Seong-Il\|first2=Lee\|last3=Won Joon\|first3=Cho\|date=September 2020\|title=Volumetric assessment of extrusion in medial meniscus posterior root tears through semi-automatic segmentation on 3-tesla magnetic resonance images\|url=https://www.sciencedirect.com/science/article/abs/pii/S1877051720301994\|journal=Orthopaedics & Traumatology: Surgery & Research\|volume=101\|issue=5\|pages=963–968\|doi=10.1016/j.rcot.2020.06.003\|s2cid=225215597 \|access-date=31 January 2021}}</ref><ref>{{cite journal \|last1=Despotović\|first1=Ivana\|last2=Bart\|first2=Goossens\|last3=Wilfried\|first3=Philips\|date=1 March 2015\|title=MRI Segmentation of the Human Brain: Challenges, Methods, and Applications\|journal=Computational Intelligence Techniques in Medicine\|volume=2015\|pages=963–968\|doi=10.1155/2015/450341\|pmid=25945121\|pmc=4402572\|doi-access=free}}</ref> or [[Microscope\|microscopic]] images.<ref>{{cite journal \|last1=Putzua\|first1=Lorenzo\|last2=Caocci\|first2=Giovanni\|last3=Di Rubertoa\|first3=Cecilia\|title=Leucocyte classification for leukaemia detection using image processing techniques\|journal=Artificial Intelligence in Medicine\|date=November 2014\|url=https://www.sciencedirect.com/science/article/pii/S0933365714001031~~?casa_token=kKYi78VDsLgAAAAA:VN2gsae1xi3mH6_g1LpGPbx-kS-3VwfmiFlHtAYLydID7GYJAlVLIlyB-p-LFiucergfTdzZOwM~~\|volume=63\|issue=3\|pages=179–191\|doi=10.1016/j.artmed.2014.09.002\|pmid=25241903\|hdl=11584/94592\|hdl-access=free}}</ref> Document processing is also widely used in the [[humanities]] and [[digital humanities]], in order to extract historical [[big data]] from archives or heritage collections. Specific approaches were developed for various sources, including textual documents, such as newspaper archives,<ref>{{cite conference \|url=https://www.zora.uzh.ch/id/eprint/191270/\|title=Language Resources for Historical Newspapers: the Impresso Collection\|last1=Ehrmann\|first1=Maud\|last2=Romanello\|first2=Matteo\|last3=Clematide\|first3=Simon\|last4=Ströbel\|first4=Phillip\|last5=Barman\|first5=Raphaël\|date=2020\|book-title=Proceedings of the 12th Language Resources and Evaluation Conference\|pages=958–968\|location=Marseille, France}}</ref> but also images,<ref name="cini_archive_digitization">{{cite conference \|url=https://www.ingentaconnect.com/content/ist/ac/2018/00002018/00000001/art00001\|title=New Techniques for the Digitization of Art Historical Photographic Archives - the Case of the Cini Foundation in Venice\|last1=Seguin\|first1=Benoit\|last2=Costiner\|first2=Lisandra\|last3=di Lenardo\|first3=Isabella\|last4=Kaplan\|first4=Frédéric\|date=April 1, 2018 \|book-title=Archiving 2018 Final Program and Proceedings\|publisher=Society for Imaging Science and Technology\|pages=1–5\|doi=10.2352/issn.2168-3204.2018.1.0.2}}</ref> or maps.<ref>{{cite conference \|url=https://infoscience.epfl.ch/record/268282\|title=A deep learning approach to Cadastral Computing\|last1=Ares Oliveira\|first1=Sofia\|last3=Tourenc\|first3=Bastien\|last2=di Lenardo\|first2=Isabella\|last4=Kaplan\|first4=Frédéric\|date=11 July 2019\|conference=Digital Humanities Conference\|location=Utrecht, Netherlands}}</ref><ref>{{cite thesis\|type=MSc\|last=Petitpierre\|first=Rémi\|date=July 2020\|title=Neural networks for semantic segmentation of historical city maps: Cross-cultural performance and the impact of figurative diversity\|doi=10.13140/RG.2.2.10973.64484\|arxiv=2101.12478 Line 50: ===Technologies=== If, from the 1980s ~~onwards~~onward, traditional computer vision algorithms were widely used to solve document processing problems,<ref>{{cite journal \|last1=Fujisawa\|first1=H.\|last2=Nakano\|first2=Y.\|last3=Kurino\|first3=K.\|date= July 1992 \|title=Segmentation methods for character recognition: from segmentation to document structure analysis \|url= https://ieeexplore.ieee.org/document/156471\|journal= Proceedings of the IEEE Line 57: {{cite journal \|last1=Tang\|first1=Yuan Y.\|last2=Lee\|first2=Seong-Whan\|last3=Suen\|first3=Ching Y.\|title=Automatic document processing: a survey \|url=https://www.sciencedirect.com/science/article/abs/pii/S0031320396000441\|journal=Pattern Recognition\|year=1996\|volume=29\|issue=12\|pages=1931–1952\|doi= 10.1016/S0031-3203(96)00044-1 \|bibcode=1996PatRe..29.1931T \|access-date=3 February 2021}}</ref> these have been gradually replaced by neural network technologies in the 2010s.<ref>{{cite conference \|url=https://ieeexplore.ieee.org/document/8563218\|title= dhSegment: A Generic Deep-Learning Approach for Document Segmentation\|last1=Ares Oliveira\|first1=Sofia\|last2=Seguin\|first2=Benoit\|last3=Kaplan\|first3=Frederic\|date=~~5-8~~5–8 August 2018 \|publisher=IEEE\|location=Niagara Falls, NY, USA \|conference=2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)\|doi=10.1109/ICFHR-2018.2018.00011 Line 67: \|website=Artmyn\|access-date=3 February 2021}}</ref> The digitization of 3D documents can in particular resort to derivatives of [[photogrammetry]]. Sometimes, specific 2D scanners must also be developed to adapt to the size of the documents or for reasons of scanning ergonomics.<ref name="cini_archive_digitization"/> The document processing also depends on the digital encoding of the documents in a suitable [[file format]]. Furthermore, the processing of heterogeneous databases can rely on [[image classification]] technologies. At the other end of the chain are various image completion, extrapolation or data cleanup algorithms. For textual documents, the interpretation can use [[natural language processing]] (NLP) ~~techologies~~technologies. == See also == Line 83: {{DEFAULTSORT:Document Processing}} [[Category:Automatic identification and data capture]] ~~[[Category:Artificial intelligence applications]]~~ [[Category:Applied data mining]] [[Category:Applications of computer vision]]