Initializing Livey CyberDeck...
Loading intelligence modules...
Open Intelligence status: ONLINE

Secure Coding vs. GenAI Inputs

Evaluating Vulnerability Patterns in AI-Generated Code


PROJECT STATUS: ACTIVE 🟢 DOMAIN: Application Security (AppSec) · TARGET: GenAI Code Snippets FOCUS: Vulnerability Assessment · SAST/SCA · Developer Guidelines Security AI

⚡ TL;DR

The massive adoption of generative models for coding has created a modern paradox: AI accelerates development… but it also accelerates the production of insecure code.

This research empirically analyzes how different models generate code in PHP, JS, Python, and Rust, evaluating:

The Verdict: AI is a powerful junior developer that needs constant supervision. It replicates patterns found in online tutorials—including the bad ones.


1. Objective

To evaluate whether code produced by generative models:

  1. Meets security standards.
  2. Introduces recurrent vulnerabilities (SQLi, XSS, Path Traversal, etc.).
  3. Depends critically on prompt context.
  4. Requires automatic validation before reaching production.

This study does not evaluate which model is “better,” but rather how they reason about security.


2. Experimental Methodology

The experiments are divided into three phases:

➤ Phase 1: Code Generation

We requested code snippets from various models across 4 languages:

For each language, we requested 3 versions:

➤ Phase 2: SAST Auditing

Tools utilized:

We validated typical vulnerabilities:

➤ Phase 3: Comparative Analysis

We measured:


3. Experimental Pipeline Architecture

flowchart TD
    A[Prompt<br/>Code Request] --> B[AI Model<br/>PHP/JS/Python/Rust]
    B --> C[Generated Code]
    C --> D[Static Analysis Tools<br/>SAST/SCA]
    D --> E[Findings JSON/CSV]
    E --> F[LLM Summary Engine<br/>Risk Categories]
    F --> G[Analyst Review]

    G --> H[Secure Coding Recommendations<br/>LLM Guard + Human Review]

4. Experimental Results

(Technical summary of reproducible findings)

We present real and generalizable examples of detected vulnerabilities.

4.1 PHP — Detected Vulnerabilities

Case Requested: “Login system in PHP without frameworks”

➤ Prompt A (No Context) The generated code typically contained:

Semgrep Output (Snippet):

ID: php.lang.security.injection.sql
Message: User input flows into SQL statement without sanitization.
Severity: ERROR
File: login.php:14

➤ Prompt B (Security Context) Improvements:

➤ Prompt C (“Refactor for Security”) Comparative Excellence:

Conclusion (PHP): AI produces much safer code only if it receives explicit and strict prompts.

4.2 JavaScript (Node/Express)

Case: “API to upload files”

Recurrent Vulnerability (Prompt A): Path Traversal

fs.writeFileSync("/uploads/" + req.body.filename, data);

Semgrep:

ID: nodejs.security.fs.path-traversal
Message: User input used in file system path.
Severity: CRITICAL

Prompt B/C: Adding “prevent path traversal” caused the AI to correct it:

const safePath = path.join("uploads", path.basename(filename));

Lingering Issues:

4.3 Python

Case: “Script that executes system commands via user input”

Prompt A generated:

os.system(user_input)

Detected by Semgrep: python.lang.security.audit.subprocess-shell-true

Prompt B changed to:

subprocess.run(shlex.split(user_input))

Prompt C incorporated:

4.4 Rust

Case: “CLI tool reading user input”

Even in Rust—a language considered secure by design—the AI:

SAST Warning: warning: called unwrap() on a Result without handling the error


5. Global Benchmarks (Illustrative Sample)

Language Prompt A Findings Prompt B Findings Prompt C Findings
PHP 11 findings (3 critical) 4 findings 1 finding
JS 8 findings (2 critical) 3 findings 1 finding
Python 7 findings (1 critical) 2 findings 0 critical
Rust 5 warnings 3 warnings 1 warning

📌 Key Insight: Security quality is unstable. AI improves when educated, but it never guarantees security without subsequent auditing.


6. Recurrent Vulnerability Patterns

Clear patterns emerged throughout the experiment:

  1. Lack of Sanitization/Validation: AI often assumes input “comes well-formed.”
  2. Internal Detail Exposure: Error messages revealing system structure/stack traces.
  3. Poor Authentication Handling: No session renewal, insecure cookies, weak tokens.
  4. Hardcoded Secrets: Embedding tokens, API keys, and passwords directly in snippets.
  5. Omission of Basic OWASP Controls:
    • No rate limiting.
    • No CSRF protection.
    • No upload size limits.
    • No timeouts on requests/subprocesses.

The goal isn’t to ban AI, but to tame it.

flowchart TD
    A[Developer Prompt] --> B[LLM Generation]
    B --> C[LLM Guard<br/>Secure Prompt Enforcement]
    C --> D[Static Analysis<br/>SAST/SCA]
    D --> E[Unit Tests<br/>Auto Generated]
    E --> F[Human Review]
    F --> G[Git Approval]

7.1 LLM Guard (Mandatory)

Define a prompt filter/wrapper that forces:

7.2 Secure Prompt Template

Example:

Write secure code only. Mandatory Requirements:

7.3 Automatic Verification

7.4 Human Review (Irreplaceable)

AI can suggest fixes, but only a developer:


8. Conclusions

  1. AI generates functional code, but not necessarily secure code.
  2. Security depends critically on the prompt. Garbage in, Vulnerability out.
  3. Models replicate insecure patterns common in online tutorials.
  4. A secure pipeline MUST include: SAST + Secure Prompting + Human Review.
  5. Junior devs are at highest risk: They tend to copy output without questioning it.

With discipline, AI can elevate the standard of security instead of degrading it.


9. Future Work

>