+44 115 966 7987 contact@ukdiss.com Log in

Cross-lingual plagiarism detection: how effective are translation-aware methods when students paraphrase across languages?

//

UK Dissertations

Abstract

Cross-lingual plagiarism, whereby students translate source material from one language into another and subsequently paraphrase it, presents significant challenges for academic integrity systems. This dissertation synthesises existing research to evaluate the effectiveness of translation-aware detection methods when confronted with varying degrees of paraphrasing across languages. The methodology employs a comprehensive literature synthesis, examining classical machine translation approaches, alignment-based methods, word embedding techniques, and contemporary transformer-based models. Findings reveal that whilst translation-aware systems demonstrate reasonable effectiveness for simple translations and light paraphrasing, their performance degrades substantially when students apply heavy, stylistically varied paraphrasing strategies. Machine translation combined with monolingual analysis achieves strong overall performance, yet struggles with short or heavily obfuscated fragments. Conversely, deep learning approaches utilising multilingual transformers such as mBERT and XLM-R achieve near-state-of-the-art accuracy on paraphrased cross-lingual pairs. Nevertheless, heavily reworded or tool-assisted paraphrasing remains problematic for all automated systems. The dissertation concludes that optimal detection requires hybrid approaches combining semantic models with human expert review, particularly in high-stakes academic contexts.

Introduction

Academic integrity constitutes a foundational pillar of higher education, ensuring that scholarly work maintains its value as a genuine representation of student learning and intellectual contribution. Plagiarism undermines this foundation, representing not merely an ethical violation but a fundamental threat to educational standards and the credibility of academic qualifications (Bretag and Mahmud, 2009). Whilst traditional plagiarism detection systems have achieved considerable success in identifying copied text within a single language, the globalised nature of contemporary education has introduced new complexities that challenge existing detection paradigms.

Cross-lingual plagiarism represents one such complexity that has gained prominence alongside increased student mobility, multilingual educational environments, and widespread access to machine translation technologies. This form of academic misconduct occurs when individuals translate source material from one language into another, thereby circumventing conventional plagiarism detection systems that typically operate within monolingual boundaries (Barrón-Cedeño, Gupta and Rosso, 2013). The proliferation of freely available online translation services has democratised access to translation capabilities, enabling students without formal linguistic training to translate foreign-language sources with relative ease.

The challenge intensifies considerably when students combine translation with paraphrasing strategies. Simple translation, whilst problematic, maintains lexical and syntactic patterns that sophisticated detection systems can identify through cross-lingual alignment techniques. However, when translated material undergoes additional paraphrasing—whether through manual rewording, synonym substitution, or automated paraphrasing tools—the resulting text diverges sufficiently from its source to evade many detection mechanisms (Akbari, 2021). This combination of translation and paraphrasing creates what researchers have termed “spin-translation,” a particularly challenging form of academic misconduct that exploits the limitations of current detection technologies.

The academic significance of this problem extends beyond individual instances of misconduct. Universities worldwide invest substantial resources in plagiarism detection infrastructure, yet these investments may prove inadequate against sophisticated cross-lingual obfuscation strategies. Furthermore, the integrity of academic qualifications depends upon accurate assessment of student work, which becomes compromised when plagiarism goes undetected (Prentice and Kinden, 2018). The practical implications are equally substantial, as employers and professional bodies rely upon academic credentials as indicators of competence and integrity.

This dissertation addresses these concerns by examining the current state of translation-aware cross-lingual plagiarism detection, with particular emphasis on how these systems perform when students employ paraphrasing strategies of varying intensity. The investigation synthesises findings from multiple research streams, including classical computational linguistics approaches, emerging deep learning methodologies, and empirical studies of detection effectiveness across diverse language pairs.

Aim and objectives

The primary aim of this dissertation is to evaluate the effectiveness of translation-aware cross-lingual plagiarism detection methods when confronted with student-style paraphrasing across languages, identifying both capabilities and limitations of current approaches whilst determining optimal detection strategies.

To achieve this aim, the following objectives guide the investigation:

1. To examine and categorise the principal methodological approaches employed in cross-lingual plagiarism detection, distinguishing between classical translation-based methods, alignment techniques, embedding approaches, and deep learning models.

2. To analyse the comparative effectiveness of different detection method families against varying degrees of paraphrasing intensity, from literal translation through to heavy obfuscation.

3. To identify the specific challenges posed by student-style paraphrasing behaviours, including tool-assisted paraphrasing and synonym substitution strategies.

4. To evaluate the trade-offs between precision and recall across different methodological approaches when applied to paraphrased cross-lingual text.

5. To synthesise evidence-based recommendations for educational institutions seeking to implement effective cross-lingual plagiarism detection systems that account for paraphrasing behaviours.

6. To identify gaps in current research and suggest directions for future investigation in this rapidly evolving field.

Methodology

This dissertation employs a systematic literature synthesis methodology to examine the effectiveness of translation-aware cross-lingual plagiarism detection methods. Literature synthesis represents an appropriate methodological choice when the research aim requires consolidation and critical analysis of existing empirical findings across a dispersed body of work (Snyder, 2019). Given that cross-lingual plagiarism detection represents an active research domain with contributions from computational linguistics, information retrieval, natural language processing, and educational technology, synthesis enables identification of convergent findings whilst highlighting areas of methodological or empirical disagreement.

The literature search strategy encompassed multiple academic databases, including IEEE Xplore, ACM Digital Library, Web of Science, and Scopus, reflecting the computational and interdisciplinary nature of the research domain. Search terms combined plagiarism-related vocabulary (cross-lingual plagiarism, translation plagiarism, multilingual plagiarism detection) with methodological terms (machine translation, word embeddings, transformers, semantic similarity) and outcome-related terms (paraphrase detection, obfuscation, accuracy, precision, recall).

Inclusion criteria required that sources address cross-lingual plagiarism detection explicitly, report empirical findings or systematic methodological analyses, and appear in peer-reviewed venues or recognised preprint repositories with subsequent citation validation. Sources were excluded if they addressed monolingual plagiarism exclusively, lacked methodological rigour, or failed to address paraphrasing as a variable affecting detection performance.

The synthesis process involved three analytical phases. First, categorisation organised identified methods into family groupings based upon their fundamental computational approaches: machine translation plus monolingual analysis, cross-lingual alignment, character-level methods, word and sentence embeddings, and deep transformer architectures. Second, comparative analysis examined reported performance metrics across these families, with particular attention to how paraphrasing intensity affected precision, recall, and overall accuracy. Third, critical evaluation assessed methodological limitations, generalisability constraints, and practical applicability for educational contexts.

Quality assessment of included sources considered factors including experimental design rigour, dataset representativeness, language pair diversity, and transparency of evaluation protocols. Where sources reported contradictory findings, synthesis prioritised more recent publications, larger-scale evaluations, and studies employing standardised benchmarks that facilitated cross-study comparison.

Literature review

### Evolution of cross-lingual plagiarism detection

Cross-lingual plagiarism detection emerged as a distinct research domain in response to increasing globalisation of academic practice and the widespread availability of machine translation services. Early approaches relied upon dictionary-based translation, converting texts to a common language before applying monolingual similarity measures (Potthast et al., 2011). These foundational systems demonstrated the feasibility of cross-lingual detection whilst simultaneously revealing the limitations of direct translation approaches when source material underwent modification.

The field subsequently evolved through several methodological generations. Character-based approaches, particularly cross-lingual character n-grams (CL-CNG), exploited orthographic similarities between related languages without requiring explicit translation resources. Statistical alignment methods drew upon techniques developed for machine translation itself, identifying corresponding text segments across language boundaries through probabilistic models. More recently, the emergence of distributed semantic representations and deep neural architectures has transformed the landscape, enabling detection approaches that capture meaning rather than merely surface form (Amirzhanov, Turan and Makhmutova, 2025).

### Machine translation plus monolingual analysis approaches

Machine translation followed by monolingual analysis (T+MA) represents one of the most thoroughly investigated approaches to cross-lingual plagiarism detection. This methodology translates source documents into the target language, then applies established monolingual plagiarism detection techniques to identify similarity. The intuitive appeal of this approach lies in its ability to leverage mature monolingual detection systems and continuously improving machine translation quality.

Empirical evaluations demonstrate that T+MA achieves strong overall performance across various language pairs and plagiarism scenarios. Barrón-Cedeño, Gupta and Rosso (2013) report that T+MA achieves top overall performance for cross-lingual plagiarism detection, including scenarios involving translations with subsequent manual paraphrasing. The method exhibits particular strength when dealing with complete document translations or substantial text segments where machine translation systems can establish adequate context for accurate translation.

However, T+MA performance exhibits notable degradation under specific conditions. Short text fragments present particular challenges, as machine translation quality typically suffers when limited context is available for disambiguation and phrase selection. More significantly for the present investigation, heavily paraphrased content proves problematic even when translation quality remains adequate. When students apply substantial rewording after translation, the resulting text may diverge sufficiently from what automated translation would produce that similarity measures fail to identify the relationship (Barrón-Cedeño et al., 2010).

The reliance upon machine translation quality introduces additional variability. Language pairs with well-resourced translation systems (such as English-French or English-German) typically support stronger detection performance than under-resourced pairs. Furthermore, the continuous evolution of machine translation systems implies that detection effectiveness may fluctuate as translation approaches change, creating potential reproducibility concerns for longitudinal comparisons.

### Cross-lingual alignment-based methods

Cross-lingual semantic analysis (CL-ASA) and related alignment-based methods approach detection by identifying corresponding semantic units across languages without requiring full translation. These methods construct cross-lingual representations using multilingual semantic resources such as Europarl, Wikipedia interlanguage links, or explicitly aligned parallel corpora. Text similarity is then computed within this shared semantic space.

Alignment-based approaches demonstrate distinctive performance characteristics. Barrón-Cedeño, Gupta and Rosso (2013) report that CL-ASA maintains high precision exceeding 0.9 in certain experimental configurations, making it particularly valuable when false positive minimisation is paramount. The method exhibits robustness at the document level, successfully identifying cross-lingual relationships even when paraphrasing has altered surface features.

Nevertheless, alignment methods exhibit a characteristic precision-recall trade-off that becomes pronounced with intensified paraphrasing. Whilst correctly identifying plagiarised content with high confidence when detected, these methods miss proportionally more paraphrased cases compared to translation-based alternatives. The reliance upon pre-existing alignment resources also constrains applicability to language pairs with adequate coverage in multilingual semantic databases (Tlitova et al., 2020).

### Character n-gram approaches

Cross-lingual character n-gram methods (CL-CNG) represent a computationally efficient approach that exploits orthographic similarities between languages sharing alphabetic heritage or loanword vocabulary. By comparing character-level patterns rather than word-level semantics, these methods avoid the computational overhead of translation or semantic alignment.

Research consistently demonstrates that CL-CNG effectiveness depends critically upon language pair relatedness. For closely related languages sharing substantial cognate vocabulary—such as Spanish-Portuguese or Czech-Slovak—character n-gram overlap provides meaningful similarity signals even across language boundaries. However, performance deteriorates substantially for distant language pairs lacking orthographic correspondence (Barrón-Cedeño, Gupta and Rosso, 2013).

Crucially for this investigation, CL-CNG methods prove particularly vulnerable to paraphrasing obfuscation. Since synonym substitution and phrasal restructuring alter the character sequences that form the basis of comparison, even moderate paraphrasing can disrupt the n-gram matches upon which detection depends. Barrón-Cedeño et al. (2010) demonstrate this limitation empirically, finding CL-CNG struggles with both distant language pairs and stronger paraphrasing within any language pair. These findings suggest that whilst CL-CNG may serve as a rapid initial screening mechanism, it cannot reliably detect paraphrased cross-lingual plagiarism.

### Word embedding and multilingual semantic models

The development of distributed word representations and multilingual embedding spaces has enabled detection approaches that capture semantic similarity rather than lexical correspondence. These methods project texts from different languages into shared embedding spaces where semantic relationships are preserved, enabling direct similarity computation across language boundaries.

Asghari et al. (2019) provide compelling evidence for the paraphrase-resilience of embedding-based approaches. Investigating English-Persian cross-lingual plagiarism detection with seven distinct obfuscation types, they find that word-embedding-based detection yields the best recall on heavily paraphrased passages. Whilst translation-based methods achieve higher precision, they miss substantially more paraphrased cases—a critical limitation when the detection objective includes identifying sophisticated obfuscation attempts.

Thompson and Bowerman (2017) reinforce these findings through simulated embedding approaches designed to mimic translation system behaviour. Their method outperforms T+MA baselines particularly when synonyms are employed after translation, demonstrating enhanced robustness to paraphrastic modifications. The embedding approach captures semantic equivalence that persists through surface-level rewording, addressing a fundamental limitation of methods dependent upon lexical matching.

Multilingual embeddings generated from cross-lingual training objectives extend these capabilities. By learning representations that align semantically equivalent content across languages during training, these systems acquire intrinsic cross-lingual transfer capabilities. However, embedding quality depends upon training corpus coverage for relevant languages, potentially limiting effectiveness for under-resourced language pairs.

### Deep transformer architectures

The advent of transformer-based neural architectures has precipitated substantial advances in cross-lingual plagiarism detection capabilities. Multilingual pretrained models including mBERT (multilingual BERT), XLM-RoBERTa, and mBART acquire cross-lingual representations through training on multilingual corpora, enabling sophisticated semantic similarity assessment across languages.

Empirical evaluations consistently demonstrate transformer superiority for paraphrased cross-lingual detection. Avetisyan et al. (2023) report that transformer-based methods achieve state-of-the-art cross-lingual alignment with accuracy reaching approximately 99% F1 score on several European language pairs, maintaining effectiveness even with paraphrased and near-paraphrased content. Ter-Hovhannisyan and Avetisyan (2022) confirm these findings specifically for multilingual language models in plagiarism contexts, demonstrating robust performance across varied obfuscation strategies.

Bouaine and Benabbou (2024) extend the analysis to bidirectional and auto-regressive transformer variants, finding efficient cross-lingual plagiarism detection that maintains effectiveness across paraphrasing conditions. Their work emphasises the practical applicability of transformer approaches despite computational demands, suggesting that efficiency improvements have made deployment feasible.

The capacity of transformer models to capture rich semantic features underlies their paraphrase robustness. Alzahrani and Aljuaid (2020) demonstrate this through deep neural networks incorporating topic similarity, semantic role analysis, and integration with knowledge bases such as BabelNet and WordNet. Their system successfully distinguishes between literally translated, paraphrased, summarised, and independently authored cross-lingual pairs, achieving “encouraging” accuracy on Arabic-English comparisons—a language pair presenting substantial typological distance.

Despite impressive performance, transformer approaches exhibit notable limitations. Computational requirements exceed those of classical methods substantially, potentially constraining deployment in resource-limited educational contexts. Model performance depends upon language coverage during pretraining, with under-represented languages potentially receiving degraded detection capabilities. Furthermore, the opacity of neural representations complicates interpretability, potentially limiting utility in contexts requiring explanation of detection decisions.

### Tool-assisted paraphrasing and spin-translation

The proliferation of automated paraphrasing tools introduces additional complexity that warrants specific consideration. Spin-translation, whereby students employ paraphrasing tools to further obfuscate translated content, represents an increasingly common obfuscation strategy that challenges even sophisticated detection systems (Akbari, 2021).

Prentice and Kinden (2018) examine the intersection of paraphrasing tools, translation technologies, and plagiarism detection through empirical investigation. Their findings indicate that systems optimised for literal translation detection systematically under-detect content processed through additional obfuscation stages. This detection gap has practical significance, as students aware of institutional detection capabilities may deliberately employ tool chains designed to exploit these limitations.

The challenge extends beyond detection capability to pedagogical response. Tool-assisted obfuscation produces text that may superficially appear original whilst deriving entirely from unattributed sources. Distinguishing this output from genuine paraphrasing reflecting comprehension presents difficulties for both automated systems and human reviewers, complicating the determination of intent that often factors into academic misconduct proceedings.

Discussion

The synthesised evidence reveals a complex landscape in which translation-aware cross-lingual plagiarism detection methods demonstrate substantial capability whilst exhibiting characteristic limitations when confronted with paraphrasing behaviours typical of student misconduct. This discussion critically analyses these findings in relation to the stated objectives, examining implications for both theoretical understanding and practical implementation.

### Differential effectiveness across method families

The first two objectives sought to categorise principal methodological approaches and analyse their comparative effectiveness against varying paraphrasing intensities. The literature synthesis reveals clear differentiation between method families on this dimension. Classical approaches including T+MA and CL-ASA demonstrate strong performance on literal translations and light paraphrasing, achieving the high precision values that educational institutions require to avoid false accusations. However, their effectiveness diminishes progressively as paraphrasing intensity increases.

This pattern reflects the fundamental computational basis of these methods. Translation-based approaches depend upon correspondences between machine-translated output and submitted text, correspondences that paraphrasing disrupts by substituting alternative lexical choices and syntactic structures. Alignment-based methods similarly rely upon identifiable semantic correspondences that heavy paraphrasing obscures. The finding that CL-CNG performs poorly on both distant language pairs and strong paraphrasing follows logically from its dependence upon orthographic similarity—a feature absent in both scenarios.

Embedding and transformer-based methods demonstrate qualitatively different behaviour, maintaining higher recall on heavily paraphrased content albeit with increased false positive rates. This trade-off reflects the semantic abstraction inherent in distributed representations: by capturing meaning at levels abstracted from surface form, these methods remain sensitive to underlying semantic correspondence even when surface realisation diverges substantially. Asghari et al.’s (2019) finding that embedding methods achieve best recall on heavily paraphrased passages provides direct evidence for this mechanism.

The performance differential carries significant implications for detection system design. Where detection objectives prioritise comprehensive coverage including sophisticated obfuscation attempts, embedding and transformer approaches offer clear advantages. Conversely, where precision requirements dominate—as when detection outputs directly trigger misconduct proceedings—classical approaches may remain preferable despite lower coverage.

### Challenges of student-style paraphrasing

The third objective addressed specific challenges posed by student paraphrasing behaviours. The evidence indicates that student-style paraphrasing, characterised by variable quality, inconsistent application, and potential tool assistance, creates detection scenarios not fully addressed by methods optimised for consistent obfuscation patterns.

Students engaging in cross-lingual plagiarism typically lack the linguistic sophistication to paraphrase uniformly, resulting in documents containing both near-literal passages and heavily modified sections. This inconsistency actually presents detection opportunities, as unparaphrased segments may trigger detection even when paraphrased portions evade identification. However, it complicates threshold calibration, as similarity scores aggregate signals from heterogeneous text regions.

Tool-assisted paraphrasing introduces additional variability. Automated paraphrasing tools employ various strategies—synonym substitution, syntactic transformation, sentence splitting—that produce characteristic artifacts potentially distinguishable from natural writing. Akbari’s (2021) analysis of spin-translation suggests that these patterns might eventually become detectable signatures, though current systems rarely exploit this potential. The ongoing development of more sophisticated paraphrasing tools, potentially incorporating large language models, may further complicate detection by producing more natural-seeming output.

### Precision-recall trade-offs

The fourth objective examined precision-recall trade-offs across methodological approaches. The synthesis reveals a consistent pattern whereby methods achieving highest recall on paraphrased content exhibit elevated false positive rates, whilst high-precision methods miss greater proportions of paraphrased cases.

This trade-off has profound practical implications. Educational institutions must balance harms: false positive accusations cause distress and reputational damage to innocent students, whilst false negatives permit academic misconduct to go undetected and potentially uncorrected. The appropriate calibration depends upon institutional values, procedural safeguards, and the consequences attached to detection.

One resolution involves multi-stage detection architectures combining high-recall initial screening with high-precision verification. Embedding or transformer methods might identify candidate passages requiring further investigation, followed by alignment-based analysis and human review to confirm genuine plagiarism. This approach leverages complementary strengths whilst managing respective limitations, though it increases complexity and resource requirements.

### Synthesis of recommendations

The fifth objective required evidence-based recommendations for institutional implementation. The synthesis supports several specific recommendations aligned with current best practices in the literature.

First, institutions should not rely exclusively upon detection systems optimised for literal translation. The evidence consistently demonstrates that such systems under-detect paraphrased cross-lingual plagiarism, creating a false sense of security whilst missing sophisticated misconduct attempts. Detection infrastructure should incorporate methods demonstrating paraphrase robustness, particularly embedding or transformer-based approaches.

Second, multilingual transformer models represent the current state of the art and should be prioritised where computational resources permit. The remarkable accuracy achieved by mBERT, XLM-R, and related architectures on paraphrased content justifies the additional computational investment, particularly given declining computational costs and increasing model efficiency.

Third, human review remains essential for high-stakes determinations. No automated system achieves perfect accuracy on heavily paraphrased cross-lingual content, and the consequences of misconduct findings warrant human judgement. Automated systems serve most appropriately as screening tools that prioritise cases for expert review rather than definitive arbiters of misconduct.

Fourth, detection strategies should be combined. The complementary strengths of different approaches—precision from alignment methods, recall from embeddings, explainability from translation-based comparisons—suggest that hybrid systems offer advantages over any single method. Barrón-Cedeño, Gupta and Rosso (2013) explicitly advocate such combinations, and subsequent developments reinforce this recommendation.

### Research gaps and future directions

The sixth objective sought to identify research gaps warranting future investigation. Several significant gaps emerge from the synthesis.

Under-resourced language pairs receive inadequate attention relative to their practical importance. The majority of evaluated systems focus on European languages with substantial digital resources, whilst many language pairs relevant to international education lack equivalent coverage. Arabic-English studies by Alzahrani and Aljuaid (2020) represent valuable exceptions, but extension to further language pairs is needed.

Evaluation methodologies exhibit inconsistency that complicates cross-study comparison. Standardised benchmarks incorporating varied paraphrasing intensities and realistic student misconduct patterns would enable more rigorous method comparison than currently possible. The development of such benchmarks should involve collaboration between computational researchers and educational practitioners to ensure ecological validity.

The interaction between evolving machine translation systems and detection effectiveness warrants longitudinal investigation. As translation quality improves, the boundary between acceptable source use and plagiarism may shift, with implications for both detection systems and academic integrity policies. Dynamic calibration approaches that adapt to changing translation capabilities represent an unexplored research direction.

Finally, the emergence of large language models capable of sophisticated paraphrasing and content generation creates novel challenges that existing methods may not address adequately. Future research must anticipate how these technologies might be employed for academic misconduct and develop corresponding detection capabilities.

Conclusions

This dissertation has examined the effectiveness of translation-aware cross-lingual plagiarism detection methods when confronted with student-style paraphrasing, synthesising evidence from computational linguistics, natural language processing, and educational technology research. The investigation achieved its stated objectives through systematic analysis of methodological approaches, comparative evaluation of paraphrase handling capabilities, and synthesis of practical recommendations.

The evidence supports several definitive conclusions. Translation-aware cross-lingual detection systems demonstrate genuine effectiveness for simple translation and light paraphrasing, successfully identifying the majority of straightforward cross-lingual plagiarism attempts. This baseline capability represents substantial progress from earlier systems limited to monolingual detection.

However, heavy paraphrasing of the type students might employ when deliberately evading detection continues to challenge all automated approaches. Classical methods including machine translation plus monolingual analysis and alignment-based techniques, whilst achieving high precision, miss proportionally more heavily paraphrased cases. Character n-gram approaches fail almost entirely on distant language pairs or strong paraphrasing combinations.

Modern embedding and transformer-based approaches substantially improve detection of paraphrased translations, achieving state-of-the-art accuracy even on heavily modified content. These methods capture semantic similarity at levels abstracted from surface form, enabling identification of underlying correspondences despite lexical and syntactic divergence. The performance improvements justify prioritising these approaches in detection infrastructure development.

Nevertheless, detection of short, heavily reworded, or tool-assisted cross-lingual plagiarism remains imperfect across all evaluated methods. This limitation carries significant implications for academic integrity assurance, suggesting that automated detection cannot substitute for pedagogical approaches emphasising understanding over submission, appropriate source acknowledgement practices, and institutional cultures that discourage misconduct.

The synthesis supports practical recommendations including implementation of hybrid detection architectures combining complementary method strengths, prioritisation of transformer-based approaches where resources permit, and maintenance of human review for high-stakes determinations. These recommendations align with emerging best practices whilst acknowledging the resource constraints facing many institutions.

Future research should address under-resourced language pairs, develop standardised evaluation benchmarks incorporating realistic paraphrasing patterns, investigate dynamic calibration to evolving translation capabilities, and anticipate challenges posed by emerging large language model technologies. The field remains active and consequential, with ongoing developments likely to reshape capabilities and challenges alike.

The broader significance of these findings extends to fundamental questions about academic integrity in an era of increasingly sophisticated language technologies. As translation and paraphrasing tools become more capable, the boundary between acceptable assistance and academic misconduct requires ongoing negotiation. Detection technologies form one component of this negotiation, but educational responses emphasising understanding, attribution, and intellectual honesty remain equally essential.

References

Akbari, A., 2021. Spinning-translation and the act of plagiarising: how to avoid and resist. *Journal of Further and Higher Education*, 45(1), pp.49-64. https://doi.org/10.1080/0309877x.2019.1709629

Alzahrani, S. and Aljuaid, H., 2020. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases. *Journal of King Saud University – Computer and Information Sciences*, 34(4), pp.1110-1123. https://doi.org/10.1016/j.jksuci.2020.04.009

Amirzhanov, A., Turan, C. and Makhmutova, A., 2025. Plagiarism types and detection methods: a systematic survey of algorithms in text analysis. *Frontiers in Computer Science*, 7. https://doi.org/10.3389/fcomp.2025.1504725

Asghari, H., Fatemi, O., Mohtaj, S., Faili, H. and Rosso, P., 2019. On the use of word embedding for cross language plagiarism detection. *Intelligent Data Analysis*, 23(3), pp.661-680. https://doi.org/10.3233/ida-183985

Avetisyan, K., Malajyan, A., Ghukasyan, T. and Avetisyan, A., 2023. A Simple and Effective Method of Cross-Lingual Plagiarism Detection. *ArXiv*, abs/2304.01352. https://doi.org/10.48550/arxiv.2304.01352

Barrón-Cedeño, A., Gupta, P. and Rosso, P., 2013. Methods for cross-language plagiarism detection. *Knowledge-Based Systems*, 50, pp.211-217. https://doi.org/10.1016/j.knosys.2013.06.018

Barrón-Cedeño, A., Rosso, P., Agirre, E. and Labaka, G., 2010. Plagiarism Detection across Distant Language Pairs. *Proceedings of the 23rd International Conference on Computational Linguistics*, pp.37-45.

Bouaine, C. and Benabbou, F., 2024. Efficient cross-lingual plagiarism detection using bidirectional and auto-regressive transformers. *IAES International Journal of Artificial Intelligence*, 13(4), pp.4619-4629. https://doi.org/10.11591/ijai.v13.i4.pp4619-4629

Bretag, T. and Mahmud, S., 2009. A model for determining student plagiarism: Electronic detection and academic judgement. *Journal of University Teaching and Learning Practice*, 6(1), pp.49-60.

Potthast, M., Barrón-Cedeño, A., Stein, B. and Rosso, P., 2011. Cross-language plagiarism detection. *Language Resources and Evaluation*, 45(1), pp.45-62.

Prentice, F. and Kinden, C., 2018. Paraphrasing tools, language translation tools and plagiarism: an exploratory study. *International Journal for Educational Integrity*, 14(1), pp.1-16. https://doi.org/10.1007/s40979-018-0036-7

Snyder, H., 2019. Literature review as a research methodology: An overview and guidelines. *Journal of Business Research*, 104, pp.333-339.

Ter-Hovhannisyan, T. and Avetisyan, K., 2022. Transformer-Based Multilingual Language Models in Cross-Lingual Plagiarism Detection. *2022 Ivannikov Memorial Workshop (IVMEM)*, pp.72-80. https://doi.org/10.1109/ivmem57067.2022.9983968

Thompson, V. and Bowerman, C., 2017. Detecting Cross-Lingual Plagiarism Using Simulated Word Embeddings. *ArXiv*, abs/1712.10190.

Tlitova, A., Toschev, A., Talanov, M. and Kurnosov, V., 2020. Meta-Analysis of Cross-Language Plagiarism and Self-Plagiarism Detection Methods for Russian-English Language Pair. *Frontiers in Computer Science*, 2. https://doi.org/10.3389/fcomp.2020.523053

To cite this work, please use the following reference:

UK Dissertations. 12 February 2026. Cross-lingual plagiarism detection: how effective are translation-aware methods when students paraphrase across languages?. [online]. Available from: https://www.ukdissertations.com/dissertation-examples/cross-lingual-plagiarism-detection-how-effective-are-translation-aware-methods-when-students-paraphrase-across-languages/ [Accessed 13 February 2026].

Contact

UK Dissertations

Business Bliss Consultants FZE

Fujairah, PO Box 4422, UAE

+44 115 966 7987

Connect

Subscribe

Join our email list to receive the latest updates and valuable discounts.