AI security magic? - Dr. István Zsolt BERTA

AI security magic?

2025-09-11

Schneier's recent blog post shows an attack, where ChatGPT processes a malicious doc and makes the AI agent execute commands within. We have seen such prompt injection attacks before, what puzzles me is that it states that we still don't know how to defend against these. Some comments under the post even say that there is no way to defend against these, as this is exactly what we are using the agentic AI for, and blocking these attacks would remove the features we use it for. Any given attack can be mitigated, can we do something against these on the long run?

Most 'AI security' work I have seen treats AI as a black box, and secures what is around the AI only (and not the AI itself): filters input to remove malicious prompts, sanity checks output, logs what happens, limits privileges of the AI agent to the necessary (least privilege is always a good idea), governs training data and tweaks prompts (asking the AI not to be evil), etc and these all make sense. However, I can accept it is very difficult to defend even against such basic attacks via 'guardrails' only, and without touching what is inside. I am seeing similar claims on AI security, e.g. some recent posts by Disesdi Susanna Cox.

I have spent considerable time picking up the math behind AI/ML (linear algebra, perceptrons, neural networks, etc), but became just more concerned. I often read that AI/ML mimics how the neurons work in the human brain but we have but little clue on why this mathematical construct works, why and how it 'thinks'. (Sure, we understand how certain hyperparameters impact the way the AI learns or 'thinks'.) What made me concerned was that I met this 'we don't know how it works, it just works' statement in AI math training material. We are experimenting with the unknown, pushing random buttons on a machine we know nothing about; in some cases we get a candy, in others we get electrocuted or our left eyebrow turns purple and starts blinking. We observe, and try to figure out what is going on.

🤖🔒🧙‍♂️ Most AI security work I have seen resembles more those old Dungeons&Dragons sessions and not what I know about how science or technology should work. "Any sufficiently advanced [magic] is indistinguishable form [fancy technologies we use]...?"

Mankind advances by experimentation, there is inherently nothing wrong with it. However, if we must use it with access to production data (because sometimes we have to), we have to be conscious that it is a black box we have no reason to trust, and all we can do is throw a bunch of guardrails/wrappers around it and 🤞hope the result will be good.

I don't count myself as very knowledgeable in this area, the above just reflects what I understood. -- What do you think?

Do you know of any works on securing the AI itself and not just via guardrails? (If yes, don't spare me, and I am not afraid of reading math.)

This post was first published on Linkedin here on 2025-08-07.