PHP

Advanced Hybrid Plagiarism & AI-Detection System (WACP)

An institutional-grade academic integrity platform that detects plagiarism and AI-generated content in student project submissions. The system is built on a dual-backend architecture: a CodeIgniter 4 PHP frontend handles authentication, file management, and reporting, while a high-performance FastAPI (Python) microservice runs all NLP and machine learning analysis.

Every submission passes through a 7-stage pipeline: text extraction with positional bounding-box data, preliminary-page stripping (cover, abstract, ToC, bibliography), plagiarism comparison using n-gram fingerprinting and semantic sentence embeddings, a 4-class AI-content classifier, section-level PDF highlighting via PDF.js, auditable report generation, and dual-audience email delivery (full report to admin, summary to student).

A critical design feature is reference isolation: submissions are only compared against verified reference documents (is_reference = TRUE) and previous versions of the same project are excluded via a project_group_id lineage field, eliminating the self-comparison and version-inflation false positives that plague most plagiarism tools. For high-volume use, the system uses Celery task queues with Redis, parallel processing across 8 workers, and Redis-cached fingerprints/embeddings — processing 100 files in 5–10 minutes instead of ~50 minutes. An admin service dashboard allows live management of FastAPI, Celery, and Redis from the browser.


Advanced Hybrid Plagiarism & AI-Detection System (WACP) screenshot

Key Features

27 features built into this project

N-gram fingerprint plagiarism detection (Rabin/Karp rolling hash)
Semantic similarity via sentence-transformer embeddings
4-class AI-content classifier: Human / Raw AI / Human-edited AI / AI-paraphrased
50–200 heuristic features: linguistic perplexity + burstiness + structural uniformity + rare-synonym overuse (QuillBot fingerprint)
Hybrid Decision Layer combining ML probabilities with heuristic confidence scores
Reference isolation — submissions compared only against is_reference=TRUE documents
Project lineage tracking (project_group_id) prevents version re-submission inflation
Preliminary-page stripping: cover / abstract / ToC / acknowledgements / bibliography removed before analysis
Section-level (150–300 word window) analysis for segment-by-segment scores
PDF highlighting with bounding-box coordinates rendered via PDF.js overlay
Celery task queue with 3 dedicated queues: analysis / bulk / precompute
Redis caching of fingerprints and embeddings for instant re-use
Parallel processing with ThreadPoolExecutor (up to 8 workers)
Bulk upload mode: 100+ files in 5–10 minutes with auto-retry
Real-time job status polling from the CI4 frontend
Admin service dashboard: start / stop / log FastAPI + Celery + Redis from browser
Dual-audience email notifications: full report to admin / summary to student
Subscription & credits system with admin grant and per-user transaction history
Multi-role access control: admin / instructor / user
Multi-language support: English and French (UI + document analysis)
AI model manager: retrain classifier on demand from the admin panel
Feature drift monitoring alerts when AI writing patterns shift
Support for PDF
DOCX
and TXT file formats
Webhook callback from FastAPI to CI4 on analysis completion
Institutional audit trail: all reports stored permanently

Challenges & Solutions

Technical problems encountered during development and how each was resolved.

1

Building a reliable CI4 ↔ FastAPI contract was the first major challenge. The PHP frontend and Python backend had to stay in sync on file paths, project IDs, and webhook callbacks under concurrent load. I solved this with a strict internal API layer and a ServiceStartupFilter that auto-boots all backend services when the CI4 app first loads, eliminating the "backend not running" failure state.

2

The accuracy problem was harder. A pure n-gram fingerprint approach caught exact copies but missed paraphrased content entirely. Layering sentence-transformer semantic embeddings on top added paraphrase detection, but this introduced a new problem: the system was flagging re-submitted versions of the same project as plagiarised against themselves. I designed the reference isolation model (is_reference flag + project_group_id lineage) to eliminate self-comparison and version inflation as structural guarantees rather than threshold tuning.

3

AI-content detection required a completely separate heuristic pipeline. Detecting GPT/Claude output is straightforward — low perplexity and high uniformity are reliable signals — but detecting QuillBot-paraphrased AI text is much harder because the surface statistics change. I extracted 50–200 features covering rare synonym overuse, sentence-length uniformity, function-word ratios, and readability stability, feeding them into a Gradient Boosting 4-class classifier. All results are expressed probabilistically ("Likely AI-paraphrased, 71% confidence") to comply with institutional ethics requirements.

4

Performance collapsed under bulk submissions. Sequential processing took ~30 seconds per file, so 100 files meant 50 minutes and frequent timeouts. I rebuilt the analysis path around Celery with three dedicated queues, Redis caching for fingerprints and embeddings, and ThreadPoolExecutor for in-process parallelism — cutting 100-file batches to 5–10 minutes with automatic retry on failure.