Detecting Machine-Generated Code: Unveiling Patterns in AI-Generated Programming
Key Insights:
While previous detection methods like DetectGPT have been successful for identifying machine-generated text, they struggle with code due to its strict syntactical nature. Our research explores the unique characteristics of human and machine-authored code, analyzing aspects such as:
- Lexical Diversity: Machines use a narrower spectrum of tokens, whereas human-written code tends to be more diverse.
- Conciseness: Machines often produce concise code, while humans include more identifiers and comments.
- Naturalness: Surprisingly, machine-generated code can appear more "natural" than human code in certain scenarios, making it harder to detect using traditional methods.
Introducing DetectCodeGPT:
Building on these insights, we've developed DetectCodeGPT, a novel method that goes beyond current perturbation-based approaches. Our method focuses on specific patterns like syntactic segmentation, effectively distinguishing machine-generated code from human-authored code. By strategically perturbing the code's stylistic elements (like spaces and newlines), we significantly improve detection accuracy while maintaining computational efficiency.
Results:
Our experiments demonstrate that DetectCodeGPT outperforms state-of-the-art methods, improving detection by 7.6% in terms of AUC. Whether you're working on a software development team or researching AI and code generation, this tool could be a game-changer for maintaining code integrity and ensuring the authenticity of software artifacts.
To explore the full details of our research and try out DetectCodeGPT:
ICSE 2025 Paper: https://arxiv.org/html/2401.06461v2
Code: https://github.com/YerbaPage/DetectCodeGPT
Reference:
[1] Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers. Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu. In Proceedings of the 47th International Conference on Software Engineering (ICSE 2025).



