Marco Simoni

Marco Simoni

Researcher in AI, LLMs, Reinforcement Learning & Cybersecurity

Work with me See projects
LLM Alignment RLHF / GRPO / GTPO Cybersecurity AI Knowledge Graphs

About Me

I am a researcher passionate about Large Language Models (LLMs) and their alignment through Reinforcement Learning (RL). My work explores how advanced RL techniques such as GRPO and GTPO can improve reasoning stability and trustworthiness in LLMs. While I have applied these methods to domains like cybersecurity, my main focus is on pushing the boundaries of AI alignment and building smarter, more reliable models.

Papers

Articles

GTPO vs GRPO: A Smarter Path to Stable Reasoning LLMs

In this piece I discuss the differences between GRPO and GTPO, two reinforcement learning approaches designed to stabilize and align large language models. Why is alignment crucial for reasoning? Because RL can make LLMs not only more powerful, but also more trustworthy.

REINFORCE vs. Posterior Token Targets: Two Paths to Steering Language Models

Sharing some brief notes I wrote for myself ... maybe useful for others too. 👉 Posterior: update = p - q (deterministic, low variance, compute-heavy). 👉 REINFORCE: update = -A(y)(e_y - p) (lightweight, scalable, noisy — matches q only in expectation).

Projects

GTPO: Group-relative Trajectory-Based Policy Optimization

GTPO maze optimization visualization

This repository contains the official implementation of GTPO (Group-relative Trajectory-based Policy Optimization), a novel method for stable and effective policy optimization in Large Language Models (LLMs).

GTPO addresses key limitations of Group-relative Policy Optimization (GRPO), namely:

  1. Token-level gradient conflicts – where tokens shared across positively and negatively rewarded completions are inconsistently updated, often penalizing essential formatting tokens.
  2. Policy collapse – where negatively rewarded completions destabilize training, flattening the output distribution and degrading performance.

GTPO introduces conflict-aware gradient corrections and entropy-based regularization to mitigate these issues, ensuring more stable training without the need for KL-divergence regularization or a reference model.

TITAN: Typed Bidirectional Knowledge Graph for CTI Reasoning

TITAN is a typed, bidirectional knowledge graph framework for Cyber Threat Intelligence (CTI) reasoning and question answering. It integrates data from the MITRE ATT&CK STIX bundles, builds a TITAN Ontology, generates reasoning (CoT) and non-reasoning (NoCoT) datasets, and provides an end-to-end pipeline for model training, evaluation, and graph execution.

🎬 Demos

TITAN with Chain of Thought (CoT)
TITAN with Chain of Thought (CoT)
No Chain of Thought (Example 1)
No Chain of Thought (Example 1)
No Chain of Thought (Example 2)
No Chain of Thought (Example 2)
TITAN as a Cybersecurity Agent
TITAN as a tool for a Cybersecurity Agent

Mixture of RAG Security Experts (MoRSE)

I introduce the first specialised AI chatbot for cybersecurity MoRSE (Mixture of RAGs Security Experts), which aims to provide comprehensive and complete knowledge about cybersecurity. MoRSE uses two Retrieval Augmented Generation (RAG) systems designed to provide clear, structured, and accurate answers to cybersecurity queries. Unlike traditional Large Language Models (LLMs) that rely on Parametric Knowledge Bases, MoRSE retrieves relevant documents from Non-Parametric Knowledge Bases in response to user queries. It then uses this information to generate accurate answers, improving cybersecurity accuracy and reliability. In addition, MoRSE benefits from real-time updates to its knowledge bases, enabling continuous knowledge enrichment without retraining.