PDF: Difference between revisions

Browse history interactively

[accepted revision]

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Revision as of 10:38, 25 March 2024 edit Coolcaesar (talk \| contribs) Extended confirmed users 30,696 edits →‎PostScript language: Clarifying this ← Previous edit		Revision as of 20:05, 14 July 2024 edit undo ClueBot NG (talk \| contribs) Bots, Pending changes reviewers, Rollbackers 6,403,895 edits m Reverting possible vandalism by 182.239.164.240 to version by Annh07. Report False Positive? Thanks, ClueBot NG. (4335440) (Bot) Tag: Rollback Next edit →
(38 intermediate revisions by 23 users not shown)
Line 38: }} '''Portable Document Format''' ('''PDF'''), standardized as '''ISO 32000''', is a [[file format]] developed by [[Adobe Inc.\|Adobe]] in 1992 to present [[document]]s, including text formatting and images, in a manner independent of [[application software]], [[Computer hardware\|hardware]], and [[operating system]]s.<ref name="pdf-ref-1.7">{{cite web\|author=Adobe Systems Incorporated\|url=https://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf\|title=PDF Reference\|date=November 2006\|edition=6th\|version=1.7\|url-status=dead\|archiveurl=https://web.archive.org/web/20081001170454/https://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf\|archivedate=October 1, 2008\|accessdate=January 12, 2023}}</ref><ref>{{Cite web\|last=Warnock\|first=J.\|url=https://www.pdfa.org/norm-refs/warnock_camelot.pdf\|title=The Camelot Project\|date=14 October 2004<!--dates from PDF source-->\|orig-date=Original date 5 May 1995\|archive-url=https://web.archive.org/web/20110718230852/http://www.planetpdf.com/planetpdf/pdfs/warnock_camelot.pdf\|archive-date=July 18, 2011\|url-status=live}}</ref> Based on the [[PostScript]] language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, [[font]]s, [[vector graphics]], [[raster images]] and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder [[John Warnock]] in 1991.<ref>{{Cite web\|title=What is a PDF? Portable Document Format {{!}} Adobe Acrobat DC\|url=https://www.adobe.com/acrobat/about-adobe-pdf.html\|access-date=January 12, 2023\|publisher=Adobe Systems Inc.\|language=en\|archive-date=January 30, 2023\|archive-url=https://web.archive.org/web/20230130032548/https://www.adobe.com/acrobat/about-adobe-pdf.html\|url-status=live}}</ref> PDF was standardized as ISO 32000 in 2008.<ref>{{cite web \|url = http://wwwimages.adobe.com/www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf \|title = ISO 32000-1:2008 \|archive-url=https://web.archive.org/web/20180726064724/http://wwwimages.adobe.com/www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf \| archive-date=July 26, 2018\|url-status=dead}}</ref> The last edition as ISO 32000-2:2020 was published in December 2020. Line 45: == History == {{Main\|History of PDF}} The development of PDF began in 1991 when [[John Warnock]] wrote a paper for a project then code-named Camelot, in which he proposed the creation of a simplified version of PostScript called Interchange PostScript (IPS).<ref name="Pfiffner_Page_137">{{cite book \|last1=Pfiffner \|first1=Pamela \|title=Inside the Publishing Revolution: The Adobe Story \|date=2003 \|publisher=Peachpit Press \|location=Berkeley \|isbn=0-321-11564-3 \|page=137}}</ref> Unlike traditional PostScript, which was tightly focused on rendering [[print job]]s to output devices, IPS would be optimized for displaying pages to any screen and any platform.<ref name="Pfiffner_Page_137" /> [[Adobe Systems]] made the PDF specification available free of charge in 1993. In the early years PDF was popular mainly in [[desktop publishing]] workflows, and competed with several other formats, including [[DjVu]], [[Envoy (WordPerfect)\|Envoy]], Common Ground Digital Paper, Farallon Replica and even Adobe's own PostScript format. Line 54 ⟶ 56: ISO published ISO 32000-2 in 2017, available for purchase, replacing the free specification provided by Adobe.<ref name=nowfree/> In December 2020, the second edition of PDF 2.0, ISO 32000-2:2020, was published, with clarifications, corrections, and critical updates to normative references<ref>{{Cite web \|url=https://www.pdfa.org/iso-32000-22020-is-now-available/ \|title=ISO 32000-2:2020 is now available \|publisher=PDFA \|date=December 14, 2020 \|access-date=February 3, 2021 \|archive-date=December 4, 2022 \|archive-url=https://web.archive.org/web/20221204112238/https://www.pdfa.org/iso-32000-22020-is-now-available/ \|url-status=live }}</ref> (ISO 32000-2 does not include any proprietary technologies as normative references).<ref name=":0">{{cite web\|url=https://www.iso.org/standard/75839.html\|title=ISO 32000-2 – Document management — Portable document format — Part 2: PDF 2.0\|date=January 5, 2021 \|publisher=ISO\|access-date=February 3, 2021\|archive-date=January 28, 2021\|archive-url=https://web.archive.org/web/20210128003836/https://www.iso.org/standard/75839.html\|url-status=live}}</ref> In April 2023 the PDF Association made ISO 32000-2 available for download free of charge.<ref name=nowfree>{{cite press release\| title=Announcing no-cost access to the latest PDF standard: ISO 32000-2 (PDF 2.0)\| publisher=PDF Association\| url=https://pdfa.org/sponsored-standards \| date=16 June 2023\| orig-date=Updated; originally published 5 April 2023\| access-date=October 6, 2023\| archive-date=September 23, 2023\| archive-url=https://web.archive.org/web/20230923202322/https://pdfa.org/sponsored-standards/\| url-status=live}}</ref> == ~~{{Anchor\|Technical foundations}}~~Technical details == A PDF file is often a combination of [[vector graphics]], text, and [[bitmap graphics]]. The basic types of content in a PDF are: Line 74 ⟶ 76: === PostScript language === [[PostScript]] is a [[page description language]] run in an [[Interpreter (computing)\|interpreter]] to generate an image.<ref name="Pfiffner_Page_137" /> It can handle graphics and has standard features of [[programming language]]s such as [[conditional (computer programming)\|branching]] and [[loop (computing)\|looping]].<ref name="Pfiffner_Page_137" /> PDF is a subset of PostScript, simplified to remove such [[control flow]] features, while graphics commands remain.<ref name="Pfiffner_Page_137" /> PostScript was originally designed for a drastically different [[use case]]: transmission of one-way linear [[print ~~job]]s~~jobs in which the PostScript interpreter would collect a series of commands until it encountered the <code>showpage</code> command, then execute all the commands to render a page as a raster image to a printing device.<ref name="Pfiffner_Page_139">{{cite book \|last1=Pfiffner \|first1=Pamela \|title=Inside the Publishing Revolution: The Adobe Story \|date=2003 \|publisher=Peachpit Press \|location=Berkeley \|isbn=0-321-11564-3 \|page=139}}</ref> PostScript was not intended for long-term storage and real-time interactive rendering of [[electronic document]]s to [[computer monitor]]s, so there was no need to support ~~scrolling~~anything ~~back~~other tothan ~~previous~~consecutive rendering of pages.<ref name="Pfiffner_Page_139" /> If there was an error in the final printed output, the user would correct it at the application level and send a new print job in the form of an entirely new PostScript file. Thus, any given page in a PostScript file could be accurately rendered only as the cumulative result of executing all preceding commands to draw all previous pages—any of which could affect subsequent pages—plus the commands to draw that particular page, and there was no easy way to bypass that process to skip around to different pages.<ref name="Pfiffner_Page_139" /> Traditionally, to go from PostScript to PDF, a source PostScript file (that is, an executable program) is used as the basis for generating PostScript-like PDF code (see, e.g., [[Adobe Distiller]]). This is done by applying standard [[compiler]] techniques like [[loop unrolling]], [[inline expansion\|inlining]] and removing unused branches, resulting in code that is purely declarative and static.<ref name="Pfiffner_Page_139" /> The end result is then packaged into a [[container format]], together with all necessary [[Dependency (computer science)\|dependencies]] for correct rendering (external files, graphics, or fonts to which the document refers), and [[Data compression\|compressed]]. Modern applications write to printer drivers which directly generate PDF rather than going through PostScript first. As a document format, PDF has several advantages over PostScript: * PDF contains only static [[Declarative programming\|declarative]] PostScript code that can be processed as data, and does not require a full program [[Interpreter (computing)\|interpreter]] or [[compiler]].<ref name="Pfiffner_Page_139" /> This avoids the complexity and security risks of an engine with such a higher complexity level. * Like [[Display PostScript]], PDF has supported [[transparency (graphic)\|transparent graphics]] since version 1.4, while standard PostScript does not. * PDF enforces the rule that the code for any particular page cannot affect any other pages.<ref name="Pfiffner_Page_139" /> That rule is strongly recommended for PostScript code too, but has to be implemented explicitly (see, e.g., the [[Document Structuring Conventions]]), as PostScript is a full programming language that allows for such greater flexibilities and is not limited to the concepts of pages and documents. * All data required for rendering is included within the file itself, improving portability.<ref>{{cite web \|url=https://www.adobe.com/content/dam/acom/en/devnet/actionscript/articles/PLRM.pdf\|title=PostScript Language Reference]\|archive-url=https://web.archive.org/web/20210724120635/https://www.adobe.com/content/dam/acom/en/devnet/actionscript/articles/PLRM.pdf\|archive-date=2021-07-24\|url-status=dead}}</ref> Its disadvantages are: * ~~Loss~~A loss of flexibility, and limitation to a single use case.{{cncitation needed\|date=December 2023}} * A (sometimes much) larger file size.<ref>{{cite web \|last1=Anton Ertl \|first1=Martin \|title=What is the PDF format good for? \|url=https://www.complang.tuwien.ac.at/anton/why-not-pdf.html \|website=complang.tuwien.ac.at \|publisher=Vienna University of Technology \|access-date=8 April 2024\|archive-url=https://web.archive.org/web/20240404031526/https://www.complang.tuwien.ac.at/anton/why-not-pdf.html\|archive-date=4 April 2024\|url-status=live}}</ref> * A (sometimes much) larger size. For trivially repetitive content, this can be mitigated with compression. (Overall, compared to other formats such as a bitmap image, it is still orders of magnitude smaller.){{cn\|date=December 2023}} PDF since v1.6 supports embedding of interactive 3D documents: 3D drawings can be embedded using [[U3D]] or [[PRC (file format)\|PRC]] and various other data formats.<ref name="3d#1">{{cite web \|url=https://www.adobe.com/manufacturing/resources/3dformats/ \|title=3D supported formats \|publisher=Adobe Systems Inc. \|date=July 14, 2009 \|access-date=February 21, 2010 Line 113 ⟶ 115: Objects may be either ''direct'' (embedded in another object) or ''indirect''. Indirect objects are numbered with an ''object number'' and a ''generation number'' and defined between the <code>obj</code> and <code>endobj</code> keywords if residing in the document root. Beginning with PDF version 1.5, indirect objects (except other streams) may also be located in special streams known as ''object streams'' (marked <code>/Type /ObjStm</code>). This technique enables non-stream objects to have standard stream filters applied to them, reduces the size of files that have large numbers of small indirect objects and is especially useful for ''Tagged PDF''. Object streams do not support specifying an object's ''generation number'' (other than 0). An index table, also called the cross-reference table, is located near the end of the file and gives the byte offset of each indirect object from the start of the file.<ref>Adobe Systems, PDF Reference, pp. 39–40.</ref> This design allows for efficient [[random access]] to the objects in the file, and also allows for small changes to be made without rewriting the entire file (''incremental update''). Before PDF version 1.5, the table would always be in a special ASCII format, be marked with the <code>xref</code> keyword, and follow the main body composed of indirect objects. Version 1.5 introduced optional ''cross-reference streams'', which have the form of a standard stream object, possibly with filters applied. Such a stream may be used instead of the ASCII cross-reference table and contains the offsets and other information in binary format. The format is flexible in that it allows for integer width specification (using the <code>/W</code> array), so that for example, a document not exceeding 64 [[KiB]] in size may dedicate only 2~~ ~~ bytes for object offsets. At the end of a PDF file is a footer containing Line 128 ⟶ 130: There are two layouts to the PDF files: non-linearized (not "optimized") and linearized ("optimized"). Non-linearized PDF files can be smaller than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document are scattered throughout the PDF file. Linearized PDF files (also called "optimized" or "web optimized" PDF files) are constructed in a manner that enables them to be read in a Web browser plugin without waiting for the entire file to download, since all objects required for the first page to display are optimally organized at the start of the file.<ref name="pdf-ref">{{cite web\|url=https://www.adobe.com/devnet/pdf/pdf_reference.html\|title=Adobe Developer Connection: PDF Reference and Adobe Extensions to the PDF Specification\|publisher=Adobe Systems Inc.\|access-date=December 13, 2010\|url-status=dead\|archive-url=https://web.archive.org/web/20061115132507/https://www.adobe.com/devnet/pdf/pdf_reference.html\|archive-date=November 15, 2006}}</ref> PDF files may be optimized using [[Adobe Acrobat]] software or [[QPDF]]. Page dimensions are not limited by the format itself. However, Adobe Acrobat imposes a limit of 15 million by 15 million inches, or 225 trillion in<sup>2</sup> (145,161  km<sup>2</sup>).<ref name="pdf-ref-1.7" />{{rp\|1129}} == Imaging model == Line 146 ⟶ 148: === Raster images === ~~[[Raster graphics\|~~Raster images]] in PDF (called ''Image XObjects'') are represented by dictionaries with an associated stream. The dictionary describes the properties of the image, and the stream contains the image data. (Less commonly, small raster images may be embedded directly in a page description as an ''inline image''.) Images are typically ''filtered'' for compression purposes. Image filters supported in PDF include the following general-purpose filters: * ''ASCII85Decode'', a filter used to put the stream into 7-bit ASCII, Line 192 ⟶ 194: == Additional features == === Logical structure and accessibility<span class="anchor" id="Tagged PDF"></span> === {{See also\|PDF/A-1\|PDF/UA}} A "'''tagged" PDF'''<!--boldface per [[WP:R#PLA]]--> (see clause 14.8 in ISO 32000) includes document structure and semantics information to enable reliable text extraction and [[accessibility]].<ref>{{cite web \|title=Tagged PDF Best Practice Guide: Syntax \|url=https://pdfa.org/wp-content/uploads/2019/06/TaggedPDFBestPracticeGuideSyntax.pdf \|website=pdfa.org \|publisher=[[PDF Association]] \|date=June 2019 \|access-date=2024-06-24}}</ref> Technically speaking, tagged PDF is a stylized use of the format that builds on the logical structure framework introduced in PDF 1.3. Tagged PDF defines a set of standard structure types and attributes that allow page content (text, graphics, and images) to be extracted and reused for other purposes.<ref>{{cite web\|first=Duff\|last=Johnson\|date=April 22, 2004\|title=What is Tagged PDF?\|url=https://www.talkingpdf.org/what-is-tagged-pdf/\|url-status=live\|archive-url=https://web.archive.org/web/20040807132851/http://www.planetpdf.com/enterprise/article.asp?ContentID=6067\|archive-date=August 7, 2004}}</ref> Tagged PDF is not required in situations where a PDF file is intended only for print. Since the feature is optional, and since the rules for ~~Tagged~~tagged PDF were relatively vague in ISO 32000-1, support for tagged PDF among consuming devices, including [[assistive technology]] (AT), is uneven as of 2021.<ref>{{Cite web\|title=Is PDF accessible?\|website=DO-IT - Disabilities, Opportunities, Internetworking, and Technology\|publisher=University of Washington\|date=October 4, 2022\|url=https://www.washington.edu/doit/pdf-accessible?1002=\|access-date=January 12, 2023\|archive-date=February 10, 2023\|archive-url=https://web.archive.org/web/20230210114239/https://www.washington.edu/doit/pdf-accessible?1002=\|url-status=live}}</ref> ISO 32000-2, however, includes an improved discussion of tagged PDF which is anticipated to facilitate further adoption. An ISO-standardized subset of PDF specifically targeted at accessibility, [[PDF/UA]], was first published in 2012. Line 290 ⟶ 293: ===Malware vulnerability=== PDF files can be infected with viruses, Trojans, and other malware. They can have hidden JavaScript code that might exploit vulnerabilities in a PDF, hidden objects executed when the file that hides them is opened, and, less commonly, a malicious PDF can launch malware.<ref>{{cite web \| title=Can PDFs have viruses? Keep your files safe \| publisher=Adobe \| url=https://www.adobe.com/acrobat/resources/can-pdfs-contain-viruses.html \| access-date=3 October 2023 \| archive-date=October 4, 2023 \| archive-url=https://web.archive.org/web/20231004120143/https://www.adobe.com/acrobat/resources/can-pdfs-contain-viruses.html \| url-status=live }}</ref> PDF attachments carrying viruses were first discovered in 2001. The virus, named ''OUTLOOK.PDFWorm'' or ''Peachy'', uses [[Microsoft Outlook]] to send itself as an attached Adobe PDF file. It was activated with Adobe Acrobat, but not with Acrobat Reader.<ref>Adobe Forums, [https://forums.adobe.com/thread/302989 Announcement: PDF Attachment Virus "Peachy"] {{Webarchive\|url=https://web.archive.org/web/20150904151955/https://forums.adobe.com/thread/302989 \|date=September 4, 2015 }}, August 15, 2001.</ref> Line 305 ⟶ 308: Many PDF viewers are provided free of charge from a variety of sources. Programs to manipulate and edit PDF files are available, usually for purchase. There are many software options for creating PDFs, including the PDF printing capabilities built into [[macOS]], [[iOS]],<ref>{{Cite web\|url=https://ijunkie.com/how-to-create-pdf-web-page-safari-iphone-ipad-ios-11/\|title=How to Create a PDF from Web Page on iPhone and iPad in iOS 11\|last=Pathak\|first=Khamosh\|date=October 7, 2017\|website=iJunkie\|access-date=January 12, 2023\|archive-date=January 12, 2023\|archive-url=https://web.archive.org/web/20230112153246/https://ijunkie.com/how-to-create-pdf-web-page-safari-iphone-ipad-ios-11/\|url-status=live}}</ref> and most [[Linux]] distributions. Much document processing software including [[LibreOffice]], [[Microsoft Office 2007]] (if updated to [[Office 2007#Service Pack 2\|SP2]]) and later,<ref>{{cite web\|url=http://support.microsoft.com/kb/953195\|title=Description of 2007 Microsoft Office Suite Service Pack 2 (SP2)\|publisher=[[Microsoft]]\|url-status=dead\|archive-url=https://web.archive.org/web/20090429212434/http://support.microsoft.com/kb/953195\|archive-date=April 29, 2009\|access-date=January 12, 2023}}</ref> [[WordPerfect]] 9, and [[Scribus]] can export documents in PDF format. There are many PDF print drivers for Microsoft Windows, the [[pdfTeX]] typesetting system, the [[DocBook]] PDF tools, applications developed around [[Ghostscript]] and [[Adobe Acrobat]] itself as well as [[Adobe InDesign]], [[Adobe FrameMaker]], Adobe Illustrator, Adobe Photoshop, that allow a "PDF printer" to be set up, which when selected sends output to a PDF file instead of a physical printer. [[Google]]'s online office suite [[Google Docs]] allows uploading and saving to PDF. Some web apps offer free PDF editing and annotation tools. The [[Free Software Foundation]] ~~were~~was "developing a free, high-quality and fully functional set of libraries and programs that implement the PDF file format and associated technologies to the ISO 32000 standard", as one of ~~their~~its [[High priority free software projects\|high priority projects]].<ref>On 2014-04-02, a note dated February 10, 2009 referred to [http://www.fsf.org/campaigns/priority.html Current FSF High Priority Free Software Projects] {{Webarchive\|url=https://web.archive.org/web/20070810230457/http://www.fsf.org/campaigns/priority.html \|date=August 10, 2007 }} as a source. Content of the latter page, however, changes over time.</ref><ref>{{cite web\|url=http://gnupdf.org/Goals_and_Motivations\|title=Goals and Motivations\|publisher=GNUpdf\|date=November 28, 2007\|website=gnupdf.org\|access-date=April 2, 2014\|archive-date=July 4, 2014\|archive-url=https://web.archive.org/web/20140704114405/http://www.gnupdf.org/Goals_and_Motivations\|url-status=live}}</ref> In 2011, however, the GNU PDF project was removed from the list of "high priority projects" due to the maturation of the [[Poppler (software)\|Poppler library]],<ref>{{cite web\|title=GNU PDF project leaves FSF High Priority Projects list; mission complete!\|url=http://www.fsf.org/blogs/community/gnu-pdf-project-leaves-high-priority-projects-list-mission-complete\|date=October 6, 2011\|first=Matt\|last=Lee\|publisher=Free Software Foundation\|website=fsf.org\|archive-date=December 28, 2014\|archive-url=https://web.archive.org/web/20141228050435/http://www.fsf.org/blogs/community/gnu-pdf-project-leaves-high-priority-projects-list-mission-complete\|url-status=live}}</ref> which has enjoyed wider use in applications such as [[Evince]] with the [[GNOME]] desktop environment. Poppler is based on [[Xpdf]]<ref>{{cite web\|url=http://poppler.freedesktop.org/\|title=Poppler Homepage\|quote=Poppler is a PDF rendering library based on the xpdf-3.0 code base.\|access-date=January 12, 2023\|archive-date=January 8, 2015\|archive-url=https://web.archive.org/web/20150108235708/http://poppler.freedesktop.org/\|url-status=live}}</ref><ref>{{cite web\|url=http://cgit.freedesktop.org/poppler/poppler/tree/README-XPDF\|title=Xpdf License\|quote=Xpdf is licensed under the GNU General Public License (GPL), version 2 or 3.\|access-date=January 12, 2023\|archive-date=April 14, 2013\|archive-url=https://archive.today/20130414194348/http://cgit.freedesktop.org/poppler/poppler/tree/README-XPDF\|url-status=live}}</ref> code base. There are also commercial development libraries available as listed in [[List of PDF software]]. The [[Apache PDFBox]] project of the [[Apache Software Foundation]] is an open source Java library, licensed under the [[Apache License]], for working with PDF documents.<ref>{{cite web\|url=http://pdfbox.apache.org/\|url-status=live\|title=The Apache PDFBox project- Apache PDFBox 3.0.0 released\|date=August 17, 2023\|archive-date=January 7, 2023\|archive-url=https://web.archive.org/web/20230107234923/https://pdfbox.apache.org/}} Updated for new releases.</ref> Line 315 ⟶ 318: [[Raster image processor]]s (RIPs) are used to convert PDF files into a [[raster graphics\|raster format]] suitable for imaging onto paper and other media in printers, digital production presses and [[prepress]] in a process known as [[rasterization]]. RIPs capable of processing PDF directly include the Adobe PDF Print Engine<ref>{{cite web\|url=https://www.adobe.com/products/pdfprintengine/overview.html\|title=Adobe PDF Print Engine\|publisher=Adobe Systems Inc.\|access-date=August 20, 2014\|archive-date=August 22, 2013\|archive-url=https://web.archive.org/web/20130822034446/http://www.adobe.com/products/pdfprintengine/overview.html\|url-status=live}}</ref> from Adobe Systems and Jaws<ref>{{cite web\|url=http://www.globalgraphics.com/products/jaws_rip/\|title=Jaws® 3.0 PDF and PostScript RIP SDK\|work=globalgraphics.com\|access-date=November 26, 2010\|archive-date=March 5, 2016\|archive-url=https://web.archive.org/web/20160305090728/http://globalgraphics.com/products/jaws_rip\|url-status=dead}}</ref> and the [[Harlequin RIP]] from [[Global Graphics]]. In 1993, the Jaws raster image processor from Global Graphics became the first shipping prepress RIP that interpreted PDF natively without conversion to another format. The company released an upgrade to ~~their~~its Harlequin RIP with the same capability in 1997.<ref>{{cite web \|url = http://www.globalgraphics.com/products/harlequin-multi-rip \|title=Harlequin MultiRIP\|access-date=March 2, 2014\|url-status=dead\|archive-url=https://web.archive.org/web/20140209215413/http://www.globalgraphics.com/products/harlequin-multi-rip/\|archive-date=February 9, 2014 }}</ref> [[Agfa-Gevaert]] introduced and shipped Apogee, the first prepress workflow system based on PDF, in 1997. Line 337 ⟶ 340: There are also [[web annotation]] systems that support annotation in pdf and other document formats. In cases where PDFs are expected to have all of the functionality of paper documents, ink annotation is required. === Conversion and Information Extraction === PDF's emphasis on preserving the visual appearance of documents across different software and hardware platforms poses challenges to the conversion of PDF documents to other [[~~File~~file format ~~\| file formats~~]]s and the targeted [[Information extraction \| extraction of information]], such as text, images, tables, [[Bibliography \| bibliographic information]], and document [[~~Metadata \|~~ metadata]]. Numerous tools and source code libraries support these tasks. Several labeled [[dataset ~~\| datasets~~]]s to test PDF conversion and information extraction tools exist and have been used for benchmark evaluations of the tool's performance.<ref>{{Citation \|last=Meuschke \|first=Norman \|title=A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents \|date=2023 \|work=Information for a Better World: Normality, Virtuality, Physicality, Inclusivity \|volume=13972 \|pages=383–405 \|editor-last=Sserwanga \|editor-first=Isaac \|url=https://link.springer.com/10.1007/978-3-031-28032-0_31 \|place=Cham \|publisher=Springer Nature Switzerland \|language=en \|doi=10.1007/978-3-031-28032-0_31 \|isbn=978-3-031-28031-3 \|last2=Jagdale \|first2=Apurva \|last3=Spinde \|first3=Timo \|last4=Mitrović \|first4=Jelena \|last5=Gipp \|first5=Bela \|editor2-last=Goulding \|editor2-first=Anne \|editor3-last=Moulaison-Sandy \|editor3-first=Heather \|editor4-last=Du \|editor4-first=Jia Tina\|arxiv=2303.09957 }}</ref> == Alternatives == Line 347 ⟶ 350: [[MODCA\|Mixed Object: Document Content Architecture]] is a competing format. MO:DCA-P is a part of [[Advanced Function Presentation]]. == See also == * [[Web page]] Line 358 ⟶ 362: == Further reading == * {{cite book \| last1 = Hardy \| first1 = M. R. B. \| last2 = Brailsford \| first2 = D. F. \| chapter = Mapping and displaying structural transformations between XML and PDF \| title = Proceedings of the 2002 ACM symposium on Document engineering – DocEng '02 \| pages = 95–102 \| year = 2002 \| url = https://www.cs.nott.ac.uk/~psadb1/Publications/Download/2002/Hardy02.pdf \| doi = 10.1145/585058.585077 \| publisher = Proceedings of the 2002 ACM symposium on Document engineering \|isbn = 1-58113-594-7 \| s2cid = 9371237 }}{{~~relevant?~~relevance inline\|date=May 2022\|reason=Why would random conference paper about some particular plugin for Adobe Acrobat be of interest to the reader?}} * PDF 2.0 {{cite web \|url = https://www.iso.org/standard/75839.html \|title=ISO 32000-2:2020(en), Document management — Portable document format — Part 2: PDF 2.0 \|website = International Organization for Standardization \|language = English \|access-date = December 16, 2020 }} * PDF 2.0 {{cite web \|url = https://www.iso.org/standard/63534.html \|title=ISO 32000-2:2017(en), Document management — Portable document format — Part 2: PDF 2.0 \|website = International Organization for Standardization \|date=August 3, 2017 \|language = English \|access-date = January 31, 2019 }}