Secure Coding vs. GenAI Inputs

Evaluating Vulnerability Patterns in AI-Generated Code

PROJECT STATUS: ACTIVE 🟢 DOMAIN: Application Security (AppSec) · TARGET: GenAI Code Snippets FOCUS: Vulnerability Assessment · SAST/SCA · Developer Guidelines

⚡ TL;DR

The massive adoption of generative models for coding has created a modern paradox: AI accelerates development… but it also accelerates the production of insecure code.

This research empirically analyzes how different models generate code in PHP, JS, Python, and Rust, evaluating:

Compliance with security best practices.
Introduction of recurrent vulnerabilities (logical and technical).
The critical impact of prompt context (secure vs. insecure).
Necessary mitigations to integrate AI into secure development pipelines.

The Verdict: AI is a powerful junior developer that needs constant supervision. It replicates patterns found in online tutorials—including the bad ones.

1. Objective

To evaluate whether code produced by generative models:

Meets security standards.
Introduces recurrent vulnerabilities (SQLi, XSS, Path Traversal, etc.).
Depends critically on prompt context.
Requires automatic validation before reaching production.

This study does not evaluate which model is “better,” but rather how they reason about security.

2. Experimental Methodology

The experiments are divided into three phases:

➤ Phase 1: Code Generation

We requested code snippets from various models across 4 languages:

PHP (Laravel-like & vanilla)
JavaScript (Node/Express + Client DOM)
Python (API + Scripts)
Rust (CLI & Web)

For each language, we requested 3 versions:

Version A: Basic prompt without security context.
Version B: Prompt with best practices context.
Version C: “Refactor for security” (same functionality, strictly focused on security).

➤ Phase 2: SAST Auditing

Tools utilized:

PHPStan / Larastan
Psalm
Semgrep (OWASP rulesets)
SonarQube

We validated typical vulnerabilities:

SQL Injection
XSS (Stored / Reflected)
Path Traversal
Hardcoded Secrets
Insecure Deserialization
Broken Authentication
Race Conditions (TOCTOU)
Unsafe File Operations
Missing Input Validation/Sanitization

➤ Phase 3: Comparative Analysis

We measured:

Number of findings.
Severity.
Repeated patterns.
Relationship between prompt structure and security quality.
Differences by language and model.

3. Experimental Pipeline Architecture

flowchart TD
    A[Prompt<br/>Code Request] --> B[AI Model<br/>PHP/JS/Python/Rust]
    B --> C[Generated Code]
    C --> D[Static Analysis Tools<br/>SAST/SCA]
    D --> E[Findings JSON/CSV]
    E --> F[LLM Summary Engine<br/>Risk Categories]
    F --> G[Analyst Review]

    G --> H[Secure Coding Recommendations<br/>LLM Guard + Human Review]

4. Experimental Results

(Technical summary of reproducible findings)

We present real and generalizable examples of detected vulnerabilities.

4.1 PHP — Detected Vulnerabilities

Case Requested: “Login system in PHP without frameworks”

➤ Prompt A (No Context) The generated code typically contained:

SQL Injection: SELECT * FROM users WHERE email='$email'
Plaintext Passwords: No hashing.
Superficial Validation: Only checking if fields are empty.
Insecure Session Management: No ID regeneration.
No Output Filtering: XSS risks.

Semgrep Output (Snippet):

ID: php.lang.security.injection.sql
Message: User input flows into SQL statement without sanitization.
Severity: ERROR
File: login.php:14

➤ Prompt B (Security Context) Improvements:

Used prepared statements.
Hashed passwords with bcrypt.
Issue: Error messages leaked potential info. Session handling remained incomplete.

➤ Prompt C (“Refactor for Security”) Comparative Excellence:

Correct prepared statements.
Input validation before query execution.
Secure session handling (regeneration, flags).
Generic error messages.
CSRF token added.

Conclusion (PHP): AI produces much safer code only if it receives explicit and strict prompts.

4.2 JavaScript (Node/Express)

Case: “API to upload files”

Recurrent Vulnerability (Prompt A): Path Traversal

fs.writeFileSync("/uploads/" + req.body.filename, data);

Semgrep:

ID: nodejs.security.fs.path-traversal
Message: User input used in file system path.
Severity: CRITICAL

Prompt B/C: Adding “prevent path traversal” caused the AI to correct it:

const safePath = path.join("uploads", path.basename(filename));

Lingering Issues:

No file size limits.
No MIME type validation.
No isolation for the upload processing container.

4.3 Python

Case: “Script that executes system commands via user input”

Prompt A generated:

os.system(user_input)

Detected by Semgrep: python.lang.security.audit.subprocess-shell-true

Prompt B changed to:

subprocess.run(shlex.split(user_input))

Prompt C incorporated:

Allowlist of permitted commands.
Input validation.
Timeouts.
Limited sandboxing attempts.

4.4 Rust

Case: “CLI tool reading user input”

Even in Rust—a language considered secure by design—the AI:

Generated code with .unwrap() instead of secure error handling.
Implemented an HTTP server without payload limits.
Introduced race conditions in shared file handling.
Forgot explicit error checks.

SAST Warning: warning: called unwrap() on a Result without handling the error

5. Global Benchmarks (Illustrative Sample)

Language	Prompt A Findings	Prompt B Findings	Prompt C Findings
PHP	11 findings (3 critical)	4 findings	1 finding
JS	8 findings (2 critical)	3 findings	1 finding
Python	7 findings (1 critical)	2 findings	0 critical
Rust	5 warnings	3 warnings	1 warning

📌 Key Insight: Security quality is unstable. AI improves when educated, but it never guarantees security without subsequent auditing.

6. Recurrent Vulnerability Patterns

Clear patterns emerged throughout the experiment:

Lack of Sanitization/Validation: AI often assumes input “comes well-formed.”
Internal Detail Exposure: Error messages revealing system structure/stack traces.
Poor Authentication Handling: No session renewal, insecure cookies, weak tokens.
Hardcoded Secrets: Embedding tokens, API keys, and passwords directly in snippets.
Omission of Basic OWASP Controls:
- No rate limiting.
- No CSRF protection.
- No upload size limits.
- No timeouts on requests/subprocesses.

7. Recommended Pipeline: Secure Coding with GenAI

The goal isn’t to ban AI, but to tame it.

flowchart TD
    A[Developer Prompt] --> B[LLM Generation]
    B --> C[LLM Guard<br/>Secure Prompt Enforcement]
    C --> D[Static Analysis<br/>SAST/SCA]
    D --> E[Unit Tests<br/>Auto Generated]
    E --> F[Human Review]
    F --> G[Git Approval]

7.1 LLM Guard (Mandatory)

Define a prompt filter/wrapper that forces:

Requesting security measures explicitly.
Limiting attack surface.
Adhering to project standards.
Blocking ambiguous requests.

7.2 Secure Prompt Template

Example:

Write secure code only. Mandatory Requirements:

OWASP Top 10 compliance.

No use of raw SQL.

Validate ALL inputs.

Assume all data is malicious.

Add comments explaining security choices.

Avoid hardcoded secrets.

Enforce least privilege.

Refuse insecure implementations.

7.3 Automatic Verification

PHPStan / Larastan (Laravel)
Psalm (Deep analysis)
Semgrep (OWASP rules)
SonarQube (General quality)

7.4 Human Review (Irreplaceable)

AI can suggest fixes, but only a developer:

Understands the domain.
Connects risks.
Recognizes edge cases.
Evaluates real business impact.

8. Conclusions

AI generates functional code, but not necessarily secure code.
Security depends critically on the prompt. Garbage in, Vulnerability out.
Models replicate insecure patterns common in online tutorials.
A secure pipeline MUST include: SAST + Secure Prompting + Human Review.
Junior devs are at highest risk: They tend to copy output without questioning it.

With discipline, AI can elevate the standard of security instead of degrading it.

9. Future Work

Public Dataset: “Secure vs Insecure AI Output” to reproduce this study.
Pipeline Adaptation: Implementing this workflow in real-world Laravel projects.
Optimized Semgrep Rules: Training rules specifically tuned for AI-generated code patterns.
Ecosystem Integration: Connecting findings with AetherFrame (RE + Analysis) and Podman Labs (Sandboxing tests).