Large Language Models for Code Generation: The Practitioners Perspective

Jan 28, 2025•Zeeshan Rasheed, Muhammad Waseem, Kai-Kristian Kemell +5•View PDF

TL;DR Highlight

Real-world report card: 60 active developers coded with 8 LLMs and rated them.

Who Should Read

Dev team leads or senior developers considering AI coding assistant adoption. Especially those who need to pick between GPT-4o, Llama, and Mixtral based on team budget and use case.

Core Mechanics

GPT-4o dominated at #1: 31 of 60 developers (52%) selected it as best. Superior Python code quality, library selection, and edge case handling
Llama 3.2 3B Instruct came 2nd (28%): clean code for simple tasks, fast response, cost-effective — a realistic alternative for small teams
Mixtral 8x7B Instruct (12%) specialized in data analysis and ML workflows. High consistency on repetitive tasks
GPT-3.5 Turbo ranked last: used outdated methods, misinterpreted complex requirements, and debugging actually took more time
Existing benchmarks (HumanEval, MBPP, etc.) are synthetic and don't capture real-world complexity — gap exists between practitioner ratings and benchmark scores
92% of developers rated the multi-model integration platform as 'easy to use' — the comparison environment itself is an important tool

Evidence

GPT-4o: 31/60 developers (52%) selected as best; 11 mentioned 'follows Python syntax and best practices'
Llama 3.2 3B Instruct: selected by 17 (28%), response speed 'under 1 second' vs other models
System usability: 53/60 (92%) rated 'easy to use', 92% said 'met expectations'
GPT-3.5 Turbo last: multiple practitioners said 'writing it myself would be faster'; repeatedly failed to use modern libraries

How to Apply

For complex full-stack work (backend API + frontend components) or Python ML code, default to GPT-4o when budget allows.
For simple script automation, PoC, or budget-constrained small teams, try Llama 3.2 3B Instruct first — faster response speed makes a noticeable difference on repetitive tasks.
For repeatedly generating data pipeline or ML workflow code, Mixtral 8x7B Instruct has better consistency — can save cost vs GPT-4o for domain-specific tasks.

Code Example

snippet

# Example configuration of a multi-model code generation platform using OpenRouter API
import requests

OPENROUTER_API_KEY = "your-api-key"

MODEL_MAP = {
    "gpt4o":   "openai/gpt-4o",
    "llama":   "meta-llama/llama-3.2-3b-instruct",
    "mixtral": "mistralai/mixtral-8x7b-instruct",
}

def generate_code(task_description: str, model_key: str = "gpt4o") -> str:
    model = MODEL_MAP.get(model_key, MODEL_MAP["gpt4o"])
    response = requests.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
        json={
            "model": model,
            "messages": [
                {"role": "system", "content": "You are an expert software engineer. Generate clean, production-ready code."},
                {"role": "user", "content": task_description}
            ]
        }
    )
    return response.json()["choices"][0]["message"]["content"]

# Usage example
code = generate_code("Create a CRUD API with JWT authentication using FastAPI", model_key="gpt4o")
print(code)

Terminology

HumanEvalThe standard test for LLM code generation ability. 164 Python function problems scored by pass rate.

MBPPA Python coding problem collection crowdsourced with 974 problems. More varied difficulty than HumanEval.

OpenRouter APIA gateway service providing unified API access to LLMs from multiple companies (OpenAI, Meta, Mistral, etc.).

DevBotAn AI bot that automates development tasks — code generation, review, debugging, etc.

zero-shotSolving a problem directly without examples.

open codingA qualitative research method for analyzing qualitative data by finding repeating patterns and grouping them into categories.

Related Resources

https://github.com/GPT-Laboratory/LLM-Evaluation

Original Abstract (Expand)

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are developing various tools, benchmarks, and metrics to evaluate the effectiveness of LLM-generated code. However, there is a lack of solutions evaluated through empirically grounded methods that incorporate practitioners perspectives to assess functionality, syntax, and accuracy in real world applications. To address this gap, we propose and develop a multi-model unified platform to generate and execute code based on natural language prompts. We conducted a survey with 60 software practitioners from 11 countries across four continents working in diverse professional roles and domains to evaluate the usability, performance, strengths, and limitations of each model. The results present practitioners feedback and insights into the use of LLMs in software development, including their strengths and weaknesses, key aspects overlooked by benchmarks and metrics, and a broader understanding of their practical applicability. These findings can help researchers and practitioners make informed decisions for systematically selecting and using LLMs in software development projects. Future research will focus on integrating more diverse models into the proposed system, incorporating additional case studies, and conducting developer interviews for deeper empirical insights into LLM-driven software development.