Login

pbcool · 08-17-2017, 12:40 AM

[attachment=5158]

PLAGIARISM AUTO-DETECTION

PLAGIARISM AUTO-DETECTION IN ARABIC SCRIPTS USING
STATEMENT-BASED FINGERPRINTS MATCHING AND
FUZZY-SET INFORMATION RETRIEVAL

SALHA MOHAMMED ALZAHRANI
A project report submitted in partial fulfilment of the
requirements for the award of the degree of
Master of Science (Computer Science)

ABSTRACT
Many plagiarism detection techniques and tools have been developed mainly
for English scripts. It has been found that different methods use different document
descriptors ranging from characters to document structure. There is possibly no
research involved in Arabic plagiarism detection although Arabic is the academic
language in Arab universities and schools. Therefore in this study, two techniques
have been developed for Arabic; three least-frequent 4-grams fingerprints matching
and fuzzy-set IR using statement-based document representation. Two statements are
treated as either similar if their fingerprints matched in the first technique, or if the
degree of similarity computed by the second technique exceeded the threshold value.
The corpora used in this study has 100 document collected from Arabic Wikipedia
with 3763 statements and 54346 non-stopped, stemmed words in total. Another 15
query documents with 943 statements were constructed with different degree of
plagiarism. Preprocessing operations were applied on the corpus collection and query
documents, such as removing stop words and stemming. Resulted documents were
stored into a database. In this study, preliminary experiments were carried out using
WCopyFind and a na ve algorithm and results are still accurate, just not optimal.
Thus, more investigation of three least-frequent 4-grams fingerprints matching and
fuzzy-set IR techniques has been done to handle more practices of plagiarism
effectively, such as rewording, rephrasing and restructuring of the statements. Our
results using both techniques with Arabic are as successful as with English taking
into account Arabic natural language processing is much more complex than English.
The main conclusion is that Arabic plagiarism best can be handled with fuzzy-set IR
since it outperforms the three least-frequent 4-grams fingerprints matching in terms
of detecting similar, but not necessarily the same, statements.