Analyseurs alternatifs

This page is a translated version of the page Alternative parsers and the translation is 100% complete.

Cette page est une compilation de liens, de descriptions et de rapports d'état des divers analyseurs alternatifs de MediaWiki — c'est-à-dire des programmes et des projets, autres que MediaWiki lui-même, qui sont capables ou destinés à traduire les syntaxe de balisage de texte en quelque chose d'autre. Certains d'entre eux ont des objectifs assez étroits, tandis que d'autres sont des candidats possibles pour remplacer le quelque peu code labyrinthique qui pilote actuellement MediaWiki lui-même.

Beaucoup des choses liées ici ont une bonne chance d'être obsolètes, sous-maintenues, ou même abandonnées. Toutefois, dans l'intérêt de ne pas dupliquer le même travail encore et encore, il semblait sensible de collecter ce qui était "là-bas". In addition, although so many alternative parsers exist, almost no unofficial parser powers any wiki site, except for wikitextparser which powers the OpenTTD wiki through TrueWiki.

Les parsers qui construisent un arbre abstrait de syntaxe (en anglais Abstract syntax tree, ou AST) et y permettent l'accès sont énumérés sur #Parsers fourinssant un AST; les parsers qui ne construisent pas un AST mais extraient de l'information sont énumérés sur #Parsers extrayant de l'information; les autres sont sur #Autres Parsers.

Parsers fournissant un AST

Logiciel libre

Nom et lien	Auteur(s) principal(aux)	Langue	Entrée	Sortie	Implémentation complète	Peut convertir la sortie en balises	Commentaires/ autres informations	Licence
Parsoid	Gabriel Wicke and the Parsoid / Visual editor team	PEG / PHP (formerly Node.js)	markup, XML dumps, test cases	tokens, HTML5 DOM with RDFa and round-trip data	oui	oui	Fully-featured round-tripping parser/runtime that powers the Visual editor on Wikipedia. Work ongoing to provide a HTML-only read / edit interface, and later to become the default parser for MediaWiki. See roadmap.	GPLv2+
DizzyLogic Wiki Parser	Dizzy Logic	C++	XML dumps	Syntax tree in XML, plain text	non	non	Fast datamining-oriented parser for English Wikipedia. Capable of processing all of English Wikipedia into plain text and XML in 2-3 hours on a modern processor. Convenient graphical interface. Windows installer available (64-bit).	MIT license
mwparserfromhell	The Earwig	Python	markup	AST	almost	oui	A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies.	MIT License
wtf_wikipedia	Spencer Kelly	JavaScript	markup	JSON	almost	non	Supports recursive links & templates, parses infoboxes and links, resolves special templates, parses images and categories. runs server-side & browser.	MIT
wikiapi	kanashimi	JavaScript	markup	JavaScript native object	almost	oui	Parses sections, templates with parameters, links, images and categories, wiki-table to JS array or JS array to wiki-table, and many more. You may modify parts of the wikitext, then regenerate the page just using `parsed.toString()`. Runs on node.js and browser.	MIT
Sweble Wikitext Parser	Hannes Dohrn	Java	markup	AST, XML, HTML	almost	?	Claims to be very thorough. There are three papers surrounding the Sweble Wikitext Parser.	Apache License 2.0
wikitextparser	5j9	Python	markup	AST	almost	oui	Provides several accessor methods in an object tree to navigate to structural elements like sections, tables, links etc. Supports extracting table data as list of lists. Available via pip, supports Python 3.	GPLv3
mwlib	PediaPress.com	Python with C library	markup and other	parse tree, HTML, PDF, XML, OpenDocument	non	?	Used by MediaWiki's "Print/export" feature, see Reading/Web/PDF Functionality.	BSD
wb2pdf	Dirk Hünniger	Haskell	online article	LaTeX, PDF, Parse Tree, HTML, OpenDocument, EPUB	non	?	Recursive Descent based on Monadic Parser Combinators. Allows for non context-free input, especially non well formatted HTML as often found on Wikipedia.	GPL
XWiki Rendering Framework	XWiki dev team	Java	various WikiMarkups	Well formed sequence of events, HTML/XHTML, other WikiMarkups	non	non	XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki's markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups. Cant output to mediawiki format as of 2016/03 though.	LGPL
mediawiki-parser	Peter Potrowl, Erik Rose	Python	markup	XHTML, raw text, AST	non	non	GSoC-2011 project; the use of a PEG parser makes it easy to improve. Parser functions are not supported yet.	GPLv3
smc.mw	Marcus Brinkmann	Python	markup	AST, HTML	non	non	Stateful PEG parser based on Grako (Archived 2014-03-09 at the Wayback Machine), with a very clean separation of parsing stages, grammars and semantic transformations.	BSD
Pandoc	John MacFarlane	Haskell	markup	many & AST	non	not identical	Can convert subset of mediawiki markup to ~35 different formats (5 of which are flavors of markdown).	GPLv2
MwParserFromScratch	CXuesong	C#	markup	AST	non	oui	A portable .NET library that parses wikitext into Abstract Syntax Tree. For now it supports most of the common markup expressions except file links, double-underscored magic words, and tables.	Apache License
gensim.segment_wiki	RaRe Technologies	Python	MediaWiki XML	JSON	non	non	Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python, segment_wiki - script for wikipedia parsing & extraction.	LGPLv2.1
mediawiki-parser	Ben Gamari	Haskell	markup or MediaWiki XML	AST	almost	non	mediawiki-parser served as the basis of the extraction pipeline of the NIST TREC Complex Answer Retrieval information retrieval track. It is a PEG parser capable of producing abstract syntax tree representing most of the Mediawiki syntax.	BSD-3-Clause
parse_wiki_text	Fredrik Portström	Rust	markup	AST	non	non	Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions.	modified MIT
wikitextprocessor	Tatu Ylonen	Python	XML dumps	AST	?	?	Can expand templates and Lua macros.	MIT unless otherwise noted in individual files (see LICENSE)
wikiparser-node	Bhsd	TypeScript	markup	AST, HTML	almost	oui	Parsing, modifying, and linting wikitext. Runs in Node.js and browser (online playground).	GPLv3

Propriétaire

Nom et lien	Auteur(s) principal(aux)	Langue	Entrée	Sortie	Implémentation complète	Peut convertir la sortie en balises	Commentaires/ autres informations	Licence
WikiTaxi	Ralf Junker	Delphi / Pascal	MediaWiki markup, page or fragment	Node-tree, HTML, potentially others	almost	Hand-crafted parser with template expansion, parser functions (core and extended), tag extensions (<ref>, <source>), wiki text parsing. Used for the WikiTaxi offline reader.	No sources available

Abandonné

Nom et lien	Auteur(s) principal(aux)	Langue	Entrée	Sortie	Implémentation complète	Peut convertir la sortie en balises	Commentaires/autres informations
DKPro JWPL parser	Torsten Zesch, Richard Eckart de Castilho, Oliver Ferschke, Elisabeth Niemann	Java	XML dump	API to access pages, outlinks, inlinks and more	non	"JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia." "JWPL is for you: If you need structured access to Wikipedia in Java." Older parser not maintained any more - JWPL uses Sweble now.	LGPL
FlexBisonParse	Timwi	flex, bison and C	markup fragment	Custom XML	non	Intended as an eventual replacement to the parsing code inside MediaWiki itself.
sanskrit-coders/wiki-tools	Vishvas Vasuki	Scala	Mediawiki text	Mediawiki text and Section tree	non	Only parses mediawiki sections - that's it. One can parse a wiki page with multiple sections, get a section tree, add, access and delete sections.	Creative commons
Perl Wikipedia Toolkit	Michal Jurosz	Perl	XML dump, SQL dump	Own parse tree, WikiMedia markup	non	Perl Wikipedia Toolkit developed for Computer-assisted Wikipedia translation. (Little functional)
WikiOnCD (Archived 2006-01-15 at the Wayback Machine)	Andrew Rodland	Perl	SQL dump or markup	HTML, Parse tree (eventually?)	non	Started out as an offline wiki browser, but grew a parser when Wiki2static turned out to be too limiting. No web presence yet; code is in the SVN.	GPL
WikiPress Publisher ^{[lien cassé]}	Erwin Jurschitza	Delphi 7	XML dump	DocBook XML, Digibib XML, HTML	non	Used for the German DVD, generates lists of bad markup.	No sources available
Saya.Parser.Wiki ^{[lien cassé]}	Nana Sakisaka	C++	markup	AST	non	Pure C++11 parser implemented with Boost.Spirit.Qi.	Boost Software License 1.0

Parsers extrayant de l'information

Nom et lien	Auteur(s) principal(aux)	Langue	Entrée	Sortie	Implémentation complète	Peut convertir la sortie en balises	Commentaires/autres informations	Licence
PHP-Wikipedia-Syntax-Parser	Don Wilson	PHP	markup	Associative array	non		Parses top-level sections, w:Wikipedia:Persondata, infoboxes, external links, categories, and interlanguage links.	GPL
Wiki-infobox-parser	Zhipeng Jiang	JavaScript	markup	JSON	non		A light Wikipedia Infobox Parser written in JavaScript.	MIT
wiktextract	Tatu Ylonen	Python	XML dumps	JSON	?		Parses most of the English Wiktionary into a JSON. Can expand templates and Lua macros. You can run it locally, or directly grab the JSON output hosted at [1].	MIT
ParseWiki	Gerges	PHP	wikitext	Associative array		Yes	A library that helps parse wikitext data	GPL-3.0

Autres parsers

Nom et lien	Auteur(s) principal(aux)	Langue	Entrée	Sortie	Implémentation complète	Commentaires / autres informations	Licence
Mylyn WikiText	David Green	Java	Local files	HTML, DocBook, Eclipse Help, DITA, extensible	non	Integration with Ant and Eclipse runtime.	EPL
wikipedia-js	kenshiro_o	Node.js	markup	HTML	non	A simple client that enables you to query Wikipedia articles in english. The results are formatted in basic HTML. You can retrieve either a summary of an article (i.e. before the table of contents) or a full article.	MIT
WikiExtractor	Giuseppe Attardi, Antonio Fuschetto	Python	XML dumps	text	non	Simple and fast tool for extracting plain text from Wikipedia dumps. It performs template expansion and handles parser functions (core and extended).	GPL
Mediawiki2HTML Machine	Johannes Buchner	PHP	markup	HTML	non	Project for parsing without the Mediawiki engine.	AGPL3 + any later version
Java API (Bliki engine)	Axel Kramer	Java	markup fragment	HTML, PDF	almost	Java Wikipedia API - (supports ParserFunctions, Lua/Scribunto...).	EPLv1.0 or GPLv2.1+
WikiCloth	nricciar	Ruby	markup	HTML	non	Ruby implementation of the MediaWiki markup language, including a fair amount of the parser functions.	MIT
YaCy	YaCy dev team	Java	XML dump	XML with Dublin Core Metadata	non	YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer.	GPL
WiktionaryParser	dev team	python	markup	JSON	non	Wiktionary parser. As of October 2019, downloads the article on-the-fly and parses "etymologies, definitions, pronunciations, examples, audio links and related words".	MIT
LuaWiki	Alexander Misel	Lua, PEG	markup	HTML	non	LuaWiki has a parser which supports most common syntaxes used in article namespace, however it defined a different grammar for templates.	GPLv3
wiktionary-dumps	excarnateSojourner	Python	XML dump	various	non	A collection of scripts for extracting various information specifically from database dumps of the English Wiktionary. Only one or two may be more broadly useful. Active as of 2023.	CC0
wikiparser-java	javalc6	Java	XML dump	various	non	The library has been developed to parse and render English Wiktionary. In addition to English, several other languages are supported.	Apache License
other wiktionary parsers	various	various	markup	various	non	See list at <stackoverflow.com/q/3364279>	various

Abandonné

Nom et lien	Auteur(s) principal(aux)	Langue	Entrée	Sortie	Implémentation complète	Commentaires/ autres informations	Licence
libmwparser	Saitmoh	C	XML dumps, markup	XML, XHTML, Expanded WikiText	almost	Primary an wikimedias offline reader with interwiki support. Libmwparser is a source independent library which supports most of MediaWiki syntax and some extensions like math or gallery.	GPL
Wiky.php	Toni Lähdekorpi	PHP, Regular Expressions	markup	HTML	non	A tiny PHP library that uses only regular expressions to convert Wiki markup to HTML.	Apache License/GPL/LGPL/MPL/CC
Wiky	Tanin Na Nakorn	Ruby	markup	HTML	non	A simple Ruby library to convert Wiki markup to HTML.	Apache License
Wiky.js	Tanin Na Nakorn	JavaScript	markup	HTML	non	A simple JavaScript library to convert Wiki markup to HTML (limited subset).	Apache License
txtwiki.js	Joao Sa	JavaScript	markup	Text	non	A JavaScript library to convert MediaWiki markup to plaintext.	MIT License
mw2html	Connelly Barnes	Python	Wiki url	HTML	non	Minimal setup - gets the basic job of creating a static copy of the wiki done.	Public Domain
PHP5 WP	Dan Goldsmith	PHP	markup	HTML	non	Parser With Plugin Framework To Add Additional Syntax. Configurable for alternative markup i.e. PMWIKI.	MPL 2.0
JAMWiki	Ryan	Java	JAMWiki front-end	HTML	non	Java Wiki engine that supports MediaWiki syntax. The roadmap also calls for XML import and export that will be compatible with Mediawiki.	LGPLv2
InstaView	Pilaf	JavaScript	markup fragment	HTML	non	Provides instant preview while editing a page (without reloading).	BSD
InstaView	C. Scott Ananian	JavaScript	markup fragment	HTML	non	Port of Pilaf's code to node.js, volo, and the browser.	BSD
Tero-dump	Tero Karvinen	?	Local wiki installation, including MySQL, PHP, web server	HTML	non	Scripts for grabbing the whole wiki; does not include images.
Text_Wiki_Mediawiki	Multiple	PHP	markup	HTML, LaTeX, Plain text	non	Part of the Text_Wiki library.	LGPL
TomeRaider export	Erik Zachte	Perl	XML dump	TomeRaider database	non	See en:Wikipedia:TomeRaider database for more details.
Waikiki	Magnus Manske	C++	SQL dump (via SQLite)	HTML	non	Abandoned in favour of "flexbisonparse", but has been used inside some experimental "front ends".
Wikiwyg (Archived 2008-12-16 at the Wayback Machine)	Jim Higson	JavaScript	A live installation of MediaWiki	HTML (via XML)	non	More than just a parser; attempts to create a fully functional client-side interface.
wik2dict	Guaka	Python	SQL dump	DICT	non
wiki2pdf	Stephan Walter	Python (and PHP)	markup fragment or set of online articles	LaTeX, PDF	non	Project is incomplete and dormant.
WikiPDF	Felipe Sanches	Python (and PHP)	One selected article	LaTeX based on templates, PDF	non	Mediawiki extension that uses Stephan Walter's wiki2pdf as backend.
Wikifilter	?	C++ (VS)	XML dumps	HTML	non	A Windows program that uses Apache/IIS to serve the pages. Abandoned in 2006, before ParserFunctions were available.
Wikipedia Dump Reader	Benjamin Thyreau	Python	XML dumps	On screen	non	Cross platform viewer.	GPLv2/~BSD license
Marker	Ryan Blue	Ruby	markup (subset)	HTML or formatted text	non	Marker is a Ruby implementation of a subset of the MediaWiki markup language, intended bring MediaWiki's markup language to non-wiki applications with multiple output formats.	GPL
Kiwi	Thomas Luce, Karl Matthias, AboutUs.org	C, Ruby, PEG	markup	HTML	almost	Kiwi is a PEG-based C implementation with Ruby bindings and a command line parser. It is very fast and supports most of the MediaWiki syntax.	BSD

Un dumper non-parser

One of the common uses of alternative parsers is to dump wiki content into static form, such as HTML or PDF. Tim Starling has written a script which isn't a parser, but uses the MediaWiki internal code to dump an entire wiki to HTML, from the command-line. See Extension:DumpHTML. This has been used (years ago) to create the static dumps at https://dumps.wikimedia.org

There are also similar dumpers as part of the Kiwix project, for example mwoffliner, and you can query the RESTBase API to obtain HTML-format output with semantic information (such as tranclusions) included.

Thèmes liés

If you want to convert MediaWiki documents into some other format, the above tools are useful. If you want to convert HTML documents or other formats into MediaWiki documents, you may find Wikipedia: Wikipedia: Tools/Editing tools#Wikisyntax conversion utilities and Manual: importing external content more useful.
One-pass parser
MediaWiki lexer and MediaWiki flexer (not parsers as such, just grammar definitions; probably superseded by/within other projects below)
en:Wikipedia:Text editor support includes various scripts and extensions for things like syntax highlighting for things like EMACS, Vim, and all sorts; some of these may include rudimentary parsing capabilities.
Here are some proof of concept rules for a subset of the Mediawiki markup: these are written in a metalanguage that treats preformatted text as source text, and everything else as comment.
Markup spec aims to produce a specification of MediaWiki's markup format.
Help:Extension:ParserFunctions is the main parser extension for MediaWiki.
mwparserfromhell and Parsoid's similar jsapi are useful tools for extraction and transformation tasks.
If no library suits your needs, you still have the option of parsing the data dumps: see meta:Data_dumps and meta:Data_dumps/Other_tools.