Large Language Models for Code Generation: A Comprehensive Evaluation
Your Name, Lisa Wang, James Thompson
arXiv preprint
Keywords
Abstract
Large language models (LLMs) have shown remarkable capabilities in code generation, but comprehensive evaluation across diverse programming tasks and languages remains limited. Existing benchmarks often focus on simple algorithmic problems and may not reflect real-world software development challenges.In this work, we provide a comprehensive evaluation of state-of-the-art LLMs for automated code generation across multiple programming languages, domains, and complexity levels. We introduce CodeEval-Pro, a new benchmark suite that includes realistic software development tasks, and propose novel evaluation metrics that consider code quality, efficiency, and maintainability.Our evaluation covers 12 LLMs across 8 programming languages and 6 domains, providing insights into model capabilities, limitations, and best practices for code generation applications.
Methodology
We developed CodeEval-Pro, a comprehensive benchmark suite consisting of 2,000 programming tasks across 8 languages (Python, Java, C++, JavaScript, Go, Rust, Swift, Kotlin) and 6 domains (algorithms, data structures, web development, machine learning, systems programming, mobile development).
Our evaluation methodology includes: (1) Functional Correctness using automated testing, (2) Code Quality assessment using static analysis tools, (3) Efficiency Analysis measuring time and space complexity, (4) Maintainability Metrics including readability and documentation, and (5) Security Analysis identifying potential vulnerabilities.
We evaluated 12 LLMs including GPT-4, Claude, Gemini, CodeT5, and specialized code models. Each model generated solutions for all benchmark tasks, and we applied our comprehensive evaluation framework to assess performance across multiple dimensions.
Results
Key findings from our evaluation: (1) GPT-4 achieved the highest overall performance with 78.3% functional correctness, (2) Specialized code models outperformed general LLMs on domain-specific tasks, (3) Performance varied significantly across programming languages, with Python showing the best results (82.1% correctness) and Rust the lowest (61.4%), (4) Code quality metrics revealed that LLMs often generate functionally correct but poorly structured code, (5) Security analysis identified vulnerabilities in 23% of generated code samples.
Domain-specific results showed that LLMs performed best on algorithmic tasks (85.2% correctness) and struggled with systems programming (58.7% correctness). The evaluation revealed significant gaps between functional correctness and production-ready code quality.
Conclusion
Our comprehensive evaluation reveals both the potential and limitations of current LLMs for code generation. While models achieve impressive functional correctness on many tasks, significant challenges remain in generating high-quality, maintainable, and secure code. The CodeEval-Pro benchmark and evaluation framework provide a foundation for future research and development in automated code generation.
Publication Details
Citation
Your Name, Lisa Wang, James Thompson. "Large Language Models for Code Generation: A Comprehensive Evaluation." arXiv preprint. 2024.