Archive for April, 2008|Monthly archive page

CLUVI

During this semestre we are finding out new pages that are quite interesting for our project. Among other interesting sites, CLUVI is one of those. But let’s begin from the beginning.

What is CLUVI?

The CLUVI (Linguistic Corpus of the University of Vigo) is an open set of parallel textual corpora of specialized registers of contemporary Galician language developed by the SLI (Computational Linguistics Group of the University of Vigo) and publicly available in its website since September 2003. The CLUVI Corpus contains over 22 million words, and its main components are the TECTRA Corpus of English-Galician literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of Galician-Spanish legal texts, the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts, the LOGALIZA Corpus of English-Galician software localization, and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information.

What is CLUVI used for or which tools can we find in it?

This web application permits both simple and very complex searches of isolated words or sequences of words, and shows the multilingual equivalences of the terms in context, as found in real and referenced translations. The terms searched can correspond to either of the languages of the translation, but it is also possible to carry out true multilingual searches, that is, to simultaneously search one term from each of the languages of translation. The number of aligned works and language pairs available in the website increases regularly, since the CLUVI is a academic research project in progress and with great vitality. At the moment, the CLUVI Parallel Corpus webpage permits to search five major corpora -TECTRA, FEGA, LEGA, UNESCO and LOGALIZA-, as well as other minor parallel corpora now in progress. It should be pointed out that the CLUVI interface also permits to browse the TURIGAL Corpus of Portuguese-English tourism texts, and the Legebiduna Corpus of Basque-Spanish administrative texts developed by the DELi group at the U. of Deusto.

Besides this, CLUVI offers a large amount of on-line documents and research projects about the CLUVI Corpus at SLI (Computational Linguistics Group).

CLUVI’s main sections:

* LEGA Corpus of Galician-Spanish legal texts (6.329.655 words) [Search] [Composition]
* UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation (3.724.620 words) [Search] [Composition]
* LOGALIZA Corpus of English-Galician software localization (2.375.157 words) [Search] [Composition]
* TECTRA Corpus of English-Galician literary texts
* FEGA Corpus of French-Galician literary texts
* CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information
* LEGE-BI Corpus of Basque-Spanish legal texts
* TURIGAL Corpus of Portuguese-English tourism texts

Apart from these, CLUVI has more sections in progress.

Great site to begin with the Corpus issue.

Bibliography:

CLUVI

(Last visited April 29th. )

What is a corpus?

The word “corpus“, derived from the Latin word meaning “body”, may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form.

Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up.

Types of corpora

There are many different kinds of corpora. They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types.

‘General corpora’ consist of general texts, texts that do not belong to a single text type, subject field, or register. An example of a general corpus is the British National Corpus. Some corpora contain texts that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect or from a particular subject area. These corpora are sometimes called ‘Sublanguage Corpora’.

Corpora can consist of texts in one language (or language variety) only or of texts in more than one language. If the texts are the same in all languages, e.i. translations, the corpus is called a Parallel Corpus. A Comparable Corpus is a collection of “similar” text.

FUENTE: What is a Corpus?

Blog Post II – Outline for the report.

Our topic is: Multilingual Corpus Resources. We will take the information from this link: Joseba Abaitua – wiki. This is what more or less we are going to do with the topic. We still have to choose the pages because there are quite a lot of them and some of them are quite interesting. Lets see what we can do with the pages:

  • all the possible search they offer
  • which results they offer
  • how they can be improved
  • their history, when they were created, for what goal, who were the creators
  • how we can compare them with similar sites: if it is better, worse, what they offer, if they offer more or less, the tools, etc.

We can be using pages like this one:

Cluvi

Follow

Get every new post delivered to your Inbox.