ARABIC CORPUS COMPILATION

The first task in Arabic lexicography is corpus compilation.

Arabic corpora can be acquired from the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA). Current popular corpora consist mainly of AFP newswire and the archives of the daily paper Al-Hayat. Corpora can also be obtained by contacting the original publishers themselves: Al-Hayat, for example, has been known to publish an annual CD-ROM of archived texts.

Corpora can also be obtained by data-mining websites directly. The Google Directory lists hundreds of Arabic-language websites. Most websites store Arabic in Arabic Windows encoding (cp-1256); a few use UTF-8 and even fewer use ISO 8859-6. Quite a few websites store Arabic in PDF files, from which Arabic text cannot be easily reconstituted.

The following is a partial listing of Arabic newspapers with sizable archives on the Web:

Al-Ahram (Cairo) http://www.ahram.org.eg
Al-Bayan (UAE) http://www.albayan.co.ae
Al-Dustur (Amman)
http://www.addustour.com
Al-Hayat (London)
http://www.alhayat.com
Al-Nahar (Beirut)
http://www.annaharonline.com
Al-Raya (Qatar)
http://www.raya.com
Al-Riyadh (Riyadh)
http://www.alriyadh-np.com
Al-Safir (Beirut)
http://www.assafir.com
Al-Sharq Al-Awsat (London)
http://www.asharqalawsat.com
Al-Watan (Qatar)
http://www.al-watan.com

Once a corpus is compiled, the next task is to assess its size in terms of types and tokens (see WORD FREQUENCY COUNTS).


HOME | CORPUS COMPILATION | WORD FREQUENCY COUNTS | CONCORDANCING | MORPHOLOGY ANALYSIS | ARABIC LEXICON

Copyright 2002 QAMUS LLC