-
trafilatura Public
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
htmldate Public
Fast and robust date extraction from web pages, with Python or on the command-line
-
courlan Public
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
-
py3langid Public
Forked from saffsd/langid.pyFaster, modernized fork of the language identification tool langid.py
-
simplemma Public
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
-
German-NLP Public
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
-
awesome-digital-humanities Public
Forked from dh-tech/awesome-digital-humanitiesSoftware for humanities scholars using quantitative or computational methods.
HTML Creative Commons Zero v1.0 Universal UpdatedOct 30, 2024 -
awesome-web-scraping Public
Forked from lorien/awesome-web-scrapingList of libraries, tools and APIs for web scraping and data processing.
Makefile Other UpdatedOct 29, 2024 -
coronakorpus Public archive
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
-
-
btw21 Public
Forked from jfilter/btw21Visualization of the most frequent words in the German federal election in 2021
Jupyter Notebook MIT License UpdatedSep 24, 2021 -
awesome-crawler Public
Forked from BruceDone/awesome-crawlerA collection of awesome web crawler,spider in different languages
-
jparser Public
Forked from fxsjy/jparserA readability parser which can extract title, content, images from html pages
Python MIT License UpdatedFeb 7, 2020 -
jlcl-style Public archive
Experiments to modernize the LaTeX class of the JLCL
-
geokelone Public
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
-
toponyms Public
Old prototype for toponym extraction in historical texts written in German
-
vardial-experiments Public
Experiments conducted on the occasion of the VarDial shared tasks
-
valency-oriented-chunker Public
A one-pass FSA valency-oriented chunker for German (proof of concept)
Perl GNU Lesser General Public License v3.0 UpdatedOct 14, 2016 -
-
-
flux-toolchain Public
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
-
equipe-crawler Public archive
Automatically exported from code.google.com/p/equipe-crawler
Perl UpdatedJul 3, 2015 -
gps-corpus-builder Public archive
Automatically exported from code.google.com/p/gps-corpus-builder
Perl UpdatedJul 3, 2015 -
zeitcrawler Public archive
Automatically exported from code.google.com/p/zeitcrawler
-
laclos Public
LAnguage-CLassified OpenSubtitles
-
microblog-explorer Public
Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
-
corpus-visualizer Public archive
Explore, visualize and publish corpora as CSS/XHTML documents
CSS UpdatedOct 14, 2012 -
url-compressor Public
A fast pattern-based URL compression for lists of links