Login

karanpatil1989 · 10-06-2017, 06:56 AM

The parser assumes precisely the tokenization of Arabic used in the Penn Arabic Treebank (ATB). You must provide input to the parser that is tokenized in this way or the resulting parses will be terrible. We do now have a software component for segmenting Arabic,but you have to download and run it first; it isn't included in the parser (see at the end of this answer). The Arabic parser simply uses a whitespace tokenizer. As far as we are aware, ATB tokenization has only an extensional definition; it isn't written down anywhere. Segmentation is done based on the morphological analyses generated by the Buckwalter analyzer. The segmentation can be characterized thus:

Almost all clitics are separated off as separate words. This includes clitic pronouns, prepositions, and conjunctions. However, the clitic determiner (definite article) "Al" ( ) is not separated off. Inflectional and derivational morphology is not separated off.
[GALE ROSETTA: These separated off clitics are not overtly marked as proclitics/enclitics, although we do have a facility to strip off the '+' and '#' characters that the IBM segmenter uses to mark enclitics and proclitics, respectively. See the example below using the option -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper]
Parentheses are rendered -LRB- and -RRB-
Quotes are rendered as (ASCII) straight single and double quotes (' and "), not as curly quotes or LaTeX-style quotes (unlike the Penn English Treebank).
Dashes are represented with the ASCII hyphen character (U+002D).
Non-break space is not used.

adarshs004 · 10-06-2017, 06:56 AM

yes i wont code for segmentation the arabic word into letters