PHP

Advanced Hybrid Plagiarism & AI-Detection System (WACP)

An institutional-grade academic integrity platform that detects plagiarism and AI-generated content in student project submissions. The system is built on a dual-backend architecture: a CodeIgniter 4 PHP frontend handles authentication, file management, and reporting, while a high-performance FastAPI (Python) microservice runs all NLP and machine learning analysis.

Every submission passes through a 7-stage pipeline: text extraction with positional bounding-box data, preliminary-page stripping (cover, abstract, ToC, bibliography), plagiarism comparison using n-gram fingerprinting and semantic sentence embeddings, a 4-class AI-content classifier, section-level PDF highlighting via PDF.js, auditable report generation, and dual-audience email delivery (full report to admin, summary to student).

A critical design feature is reference isolation: submissions are only compared against verified reference documents (is_reference = TRUE) and previous versions of the same project are excluded via a project_group_id lineage field, eliminating the self-comparison and version-inflation false positives that plague most plagiarism tools. For high-volume use, the system uses Celery task queues with Redis, parallel processing across 8 workers, and Redis-cached fingerprints/embeddings — processing 100 files in 5–10 minutes instead of ~50 minutes. An admin service dashboard allows live management of FastAPI, Celery, and Redis from the browser.

All Projects

Tech Stack

PHP 8.1 CodeIgniter 4 Python 3.10 FastAPI Celery Redis MySQL scikit-learn sentence-transformers PyTorch NLTK pdfplumber PyPDF2 python-docx SQLAlchemy Uvicorn PDF.js Tailwind CSS Bootstrap

Advanced Hybrid Plagiarism & AI-Detection System (WACP) screenshot

Key Features

27 features built into this project

N-gram fingerprint plagiarism detection (Rabin/Karp rolling hash)

Semantic similarity via sentence-transformer embeddings

4-class AI-content classifier: Human / Raw AI / Human-edited AI / AI-paraphrased

50–200 heuristic features: linguistic perplexity + burstiness + structural uniformity + rare-synonym overuse (QuillBot fingerprint)

Hybrid Decision Layer combining ML probabilities with heuristic confidence scores

Reference isolation — submissions compared only against is_reference=TRUE documents

Project lineage tracking (project_group_id) prevents version re-submission inflation

Preliminary-page stripping: cover / abstract / ToC / acknowledgements / bibliography removed before analysis

Section-level (150–300 word window) analysis for segment-by-segment scores

PDF highlighting with bounding-box coordinates rendered via PDF.js overlay

Celery task queue with 3 dedicated queues: analysis / bulk / precompute

Redis caching of fingerprints and embeddings for instant re-use

Parallel processing with ThreadPoolExecutor (up to 8 workers)

Bulk upload mode: 100+ files in 5–10 minutes with auto-retry

Real-time job status polling from the CI4 frontend

Admin service dashboard: start / stop / log FastAPI + Celery + Redis from browser

Dual-audience email notifications: full report to admin / summary to student

Subscription & credits system with admin grant and per-user transaction history

Multi-role access control: admin / instructor / user

Multi-language support: English and French (UI + document analysis)

AI model manager: retrain classifier on demand from the admin panel

Feature drift monitoring alerts when AI writing patterns shift

Support for PDF

DOCX

and TXT file formats

Webhook callback from FastAPI to CI4 on analysis completion

Institutional audit trail: all reports stored permanently

Challenges & Solutions

Technical problems encountered during development and how each was resolved.

Building a reliable CI4 ↔ FastAPI contract was the first major challenge. The PHP frontend and Python backend had to stay in sync on file paths, project IDs, and webhook callbacks under concurrent load. I solved this with a strict internal API layer and a ServiceStartupFilter that auto-boots all backend services when the CI4 app first loads, eliminating the "backend not running" failure state.

The accuracy problem was harder. A pure n-gram fingerprint approach caught exact copies but missed paraphrased content entirely. Layering sentence-transformer semantic embeddings on top added paraphrase detection, but this introduced a new problem: the system was flagging re-submitted versions of the same project as plagiarised against themselves. I designed the reference isolation model (is_reference flag + project_group_id lineage) to eliminate self-comparison and version inflation as structural guarantees rather than threshold tuning.

AI-content detection required a completely separate heuristic pipeline. Detecting GPT/Claude output is straightforward — low perplexity and high uniformity are reliable signals — but detecting QuillBot-paraphrased AI text is much harder because the surface statistics change. I extracted 50–200 features covering rare synonym overuse, sentence-length uniformity, function-word ratios, and readability stability, feeding them into a Gradient Boosting 4-class classifier. All results are expressed probabilistically ("Likely AI-paraphrased, 71% confidence") to comply with institutional ethics requirements.

Performance collapsed under bulk submissions. Sequential processing took ~30 seconds per file, so 100 files meant 50 minutes and frequent timeouts. I rebuilt the analysis path around Celery with three dedicated queues, Redis caching for fingerprints and embeddings, and ThreadPoolExecutor for in-process parallelism — cutting 100-file batches to 5–10 minutes with automatic retry on failure.

More Projects

Other work in the PHP category

View all

PHP

Charymeld Adverts Platform

A production-ready classified advertising platform built with Laravel 11, supporting a full three-sided market…

PHP 8.1 Laravel 11 MySQL

PHP

CodeSphere — LMS & Training Platform

A fully-featured Learning Management System (LMS) built with CodeIgniter 4 and deployed at training.teamodigit…

PHP 8.1 CodeIgniter 4 MySQL

PHP

TeamO Ranch — Farm & Agribusiness Website

TeamO Ranch is a sustainable farm e-commerce and service-booking website for an agricultural business based in…

PHP 7.4+ MySQL (PDO Prepared Statements) Tailwind CSS 3.4.18