[PDF][PDF] Evaluating different methods for automatically collecting large general corpora for Basque from the web

I Leturia - Proceedings of COLING 2012, 2012 - aclanthology.org
Proceedings of COLING 2012, 2012aclanthology.org
In the last few years, much work has been done to build Basque corpora. But we still lack a
large general corpus of a size comparable with those existing in other major languages, and
much more so if we take into account the corpora lately built automatically from the web,
which nowadays account for billions of word-sized corpora for English, German, Spanish,
etc. As Basque is an under-resourced language, it is thus logical that we should also turn to
this cheap and fast method of collecting corpora. In this paper we present the research we …
Abstract
In the last few years, much work has been done to build Basque corpora. But we still lack a large general corpus of a size comparable with those existing in other major languages, and much more so if we take into account the corpora lately built automatically from the web, which nowadays account for billions of word-sized corpora for English, German, Spanish, etc. As Basque is an under-resourced language, it is thus logical that we should also turn to this cheap and fast method of collecting corpora.
In this paper we present the research we have done to build a large general corpus of Basque from the web. We have tried and evaluated which of the two methods mentioned in the literature, that is, by crawling or by using search engines, best suits Basque, in terms of parameters such as speed, cost, size or quality. Our conclusion is that crawling is the one that has the potential for building the largest corpora for Basque. Using this method we have built a good quality corpus of more than 100 million words, and we expect to build a much larger one in the near future.
aclanthology.org
Showing the best result for this search. See all results