To search, Click below search items.


All Published Papers Search Service


Building Hybrid Stop-Words Technique with Normalization for Pre-Processing Arabic Text


Jaffar Atwan


Vol. 22  No. 7  pp. 65-74


In natural language processing, commonly used words such as prepositions are referred to as stop-words; they have no inherent meaning and are therefore ignored in indexing and retrieval tasks. The removal of stop-words from Arabic text has a significant impact in terms of reducing the size of a cor- pus text, which leads to an improvement in the effectiveness and performance of Arabic-language processing systems. This study investigated the effectiveness of applying a stop-word lists elimination with normalization as a preprocessing step. The idea was to merge statistical method with the linguistic method to attain the best efficacy, and comparing the effects of this two-pronged approach in reducing corpus size for Ara- bic natural language processing systems. Three stop-word lists were considered: an Arabic Text Lookup Stop-list, Frequency- based Stop-list using Zipf’s law, and Combined Stop-list. An experiment was conducted using a selected file from the Ara- bic Newswire data set. In the experiment, the size of the cor- pus was compared after removing the words contained in each list. The results showed that the best reduction in size was achieved by using the Combined Stop-list with normalization, with a word count reduction of 452930 and a compression rate of 30%.


Arabic, Normalization, Preprocessing, StopWords, Zipf’s law 2012 ACM Computing Classification System: Computing methodologies, Artificial intelligence, Natural language processing, ACM Computing Classification System: Computing methodologies, Artificial