Pitfalls in Using Multi-Agent and Workflow Systems

I’ve been trying to use AI Agents / workflows to build efficient agent teams or AI workflows to improve productivity. After two weeks of experimentation, I’ve encountered several pitfalls…

Instability Has Decreased but Not Disappeared

Before starting to implement workflows with agents, I expected that the agents themselves would be able to execute instructions stably, making workflows less necessary. However, reality has proven less ideal. Agents have their own instabilities:

Unstable control: Workflow descriptions are essentially prompts. While agents can generally follow steps, they tend to be more mechanical and repetitive. Achieving true “control” – such as skipping, looping, or branching decisions – isn’t straightforward, and there’s a probability of outcomes diverging from expectations.
Unstable invocation: Skills are fundamentally prompts as well. Although agent code may treat them specially, they’re still well-structured prompts loaded at the beginning. Whether triggering or executing them, there’s no guarantee that the LLM will output exactly as required, leading to suboptimal results.
Unstable output: Similar to using LLMs directly, it’s difficult to impose strict formatting constraints on outputs.
Multiplicative effect of instability: Just as running a GPU-trained model on a CPU can cause precision loss that completely alters predictions, the instabilities mentioned above exist in every agent. If tasks are chained together by relying on agents, the final outcome may be completely different from what was intended…

In summary, my experience is that agents do significantly stabilize LLM outputs. However, when we grant them permissions and the ability to execute commands directly, the endless extension of task chains and uncontrollable fluctuations can prevent the final results from meeting expectations.

Agents and Workflows Each Have Their Place

Considering the current issues with agents, workflows – especially those designed to strongly control LLM outputs – likely won’t be replaced anytime soon. After all, some tasks require AI to provide general ideas, but many more have defined paths or SOPs that need strict execution to obtain well‑formed results. Using AI at nodes that can’t be programmed, while generating code with AI for deterministic execution elsewhere, might be a better approach.

Figuring out which approach to use for a given task requires time and accumulated experience.