There's a lot of buzz lately around "Computer-Use" - AI agents that can supposedly click buttons, open dashboards, and magically "use" your computer. Under the hood, though, most of these solutions are just wrapping Playwright automation with a marketing label. Many developers may think they need MCP in order to run browser agents, but that's not the case.
But here's the thing: the real power isn't in automating clicks. It's in connecting autonomous agents that can code, browse, and collaborate - seamlessly.
That's exactly what the Inference Gateway CLI and its A2A (Agent‑to‑Agent) system enable.
🧐 From Computer‑Use to Real Autonomy
Instead of running a monolithic model that tries to "use a computer" developers can now work with specialized A2A Agents:
- 🤓 Coder Agent - generates and tests code
- 🌐 Browser Agent - runs browser‑based automation tasks
- 📚 Documentation Agent - resolves library names and fetches current documentation (Context7-style capabilities)
Each agent has its own lifecycle and responsibility, communicating through a shared A2A protocol - not hidden APIs.
🤔 Learning from Current Browser Agents
I recently tried OpenAI's newly launched Atlas browser agent. It's impressive technology, but it also highlighted some fundamental challenges with monolithic browser automation.
Each operation requires significant processing time - analyzing the page, determining the next action, executing it, and repeating. While this demonstrates the capability of AI to navigate interfaces, in practice you often find yourself thinking: "I could accomplish this task much faster manually".
This isn't a criticism of Atlas specifically - it's an early-stage product showing real promise. Rather, it illustrates the inherent limitations of monolithic "Computer-Use" approaches where a single model handles everything sequentially.
👁️ Visual Debugging with VNC
One of the most compelling features for observing browser automation is VNC access. Unlike OpenAI's Atlas where you observe the agent working through their interface, the Inference Gateway browser agent container exposes a VNC endpoint that lets you watch the browser automation happening live - just like sitting at the computer yourself.
When you run the A2A example with docker compose up -d, the browser agent container automatically starts with VNC enabled. You can inspect what the agent is doing by connecting to it with any VNC client (viewer):
# Default VNC port is typically 5900
# Use any VNC client like TigerVNC, RealVNC, or TightVNC
$ vnc://localhost:5900
This gives you a real-time visual window into exactly what the Inference Gateway browser agent is doing - you can watch it navigate pages, click buttons, fill forms, and capture screenshots. It's invaluable for:
- Debugging automation scripts - See exactly where and why a selector fails
- Demonstrating capabilities - Show stakeholders the agent working in real-time
- Understanding agent behavior - Watch how the agent interprets and interacts with web pages
- Building trust - Visual confirmation builds confidence in automation reliability
The VNC connection allows both passive observation and active control - you can watch the agent work autonomously, or take control when needed to debug interactively, test manual interactions, or intervene in the automation flow. This flexibility makes VNC invaluable for development and troubleshooting while maintaining the autonomous nature of A2A in production workflows.
The A2A Alternative
The A2A approach flips this entirely. Instead of one slow model trying to do everything, you have fast, specialized agents that know exactly what they're doing - and they can run in parallel, communicate instantly, and give you back control when needed.
Important architectural note: To some this might be not obvious, the LLMs themselves don't live inside the agent containers, therefore you don't need to worry about resource constraints and GPUs. The agents are lightweight containers that consume inference from the gateway and leverage TTC (Test Time Compute) capabilities. This separation means you can swap models, scale inference independently, and keep agent containers focused purely on their specialized tasks - not on running heavyweight models.
Stateless by Design
Another key design decision: the A2A protocol is almost stateless. When you submit a task and query for results, there's no persisted communication channel between the A2A client and server.
This stateless architecture makes it trivial to scale - you can spin up multiple agent instances, load balance requests, and handle failures without managing complex session state or long-lived connections, when a task gets submitted it goes into an external queue.
And because tasks are self-contained units, retry logic is built-in and straightforward. If a task fails, you can simply resubmit it without worrying about partial state, corrupted sessions, or cleanup. The agent picks up the task fresh, executes it, and returns results.
As browser automation technology matures, we'll likely see different approaches converge on similar architectural patterns - specialization, modularity, and clear separation of concerns.
Authentication & Security Constraints
One significant constraint for browser agents today is authentication. Most browsers are limited by security policies that require user credentials to continue tasks. Users often need to manually input passwords or handle 2FA challenges, which breaks the automation flow.
As of this writing, emerging solutions are addressing this - notably Browserbase's integration with 1Password's Agentic Autofill, which securely delivers credentials to AI agents at runtime without storing secrets, using human-in-the-loop authorization. This represents the kind of infrastructure evolution needed to make browser agents truly autonomous while maintaining security best practices.
⚙️ Enabling A2A Mode
The A2A workflow in the Inference Gateway CLI is enabled by a single switch.
Inside the examples/a2a directory, you'll find a complete setup using Docker Compose:
# Clone and enter the example
$ git clone https://github.com/inference-gateway/cli.git
$ cd cli/examples/a2a
# Copy environment template
$ cp .env.example .env
Set the following in your .env file:
$ INFER_GATEWAY_URL=http://inference-gateway:8080
$ INFER_A2A_ENABLED=true
$ INFER_TOOLS_ENABLED=false
$ INFER_AGENT_MODEL=deepseek/deepseek-chat
Then bring everything up:
$ docker compose up -d
This spins up:
- 🧜♀️ the Inference Gateway
- 💬 a CLI container
- 🤖 example A2A agents
💬 Running the CLI
Once running, open an interactive session:
$ docker compose run --rm cli
Inside the container, you can start an autonomous task:
$ infer agent "Open the project dashboard in the browser agent and summarize open pull requests"
Here's what happens behind the scenes:
- The Inference Gateway CLI acts as an A2A client.
- The Coder Agent (your CLI session) identifies that browser automation is required.
- It uses
A2A_SubmitTaskto send the task to the Browser Agent. - The Browser Agent performs the automation - e.g., fetching, clicking, inspecting elements.
- Results are monitored automatically through polling - the agent gets notified when tasks complete without requiring webhooks or exposing additional endpoints.
- Artifacts (screenshots, JSON logs, etc.) are retrieved using
A2A_DownloadArtifacts- stored in MinIO or on the filesystem.
All communication happens over open A2A protocol events, not vendor-locked APIs.
The beauty of true autonomy:
You don't even need an IDE running. The agents are truly agentic - you submit a task that might take several minutes, the agent works independently, and then returns with the final artifacts and results. Those results can be augmented back into the main context, allowing you to continue your work with the completed output. This is fundamentally different from tools that require constant supervision or an active development environment.
Real-world example:
Imagine a frontend developer working on a website. Instead of the traditional cycle of writing code → checking the IDE → opening the browser → verifying results → repeat, they simply work through a chat interface in the CLI. They describe what they want: "Update the navigation bar to include a dark mode toggle and ensure it's responsive on mobile". The Coder Agent writes the code, then submits a task to the Browser Agent to verify the output looks good across different screen sizes.
Here's where it gets interesting: the Browser Agent captures Playwright screenshots and sends them in base64 format to a vision-capable LLM. The LLM analyzes the visual output autonomously - checking alignment, responsiveness, color contrast, layout issues - without human intervention. If something doesn't match the requirements, the agents iterate until it's right - automatically.
The developer never switches context, never opens an IDE, never manually refreshes the browser. Multiple specialized agents, each capable of achieving their particular task, work together to fulfill the requirements completely.
🧪 Debugging & Observability
The example even includes a debugger container for monitoring A2A tasks:
$ docker compose run --rm a2a-debugger tasks list
$ docker compose run --rm a2a-debugger tasks get <task_id>
$ docker compose run --rm a2a-debugger tasks submit-streaming "List my calendar events"
This makes inter‑agent communication fully transparent - you can watch agents talk, retry, and complete tasks in real time.
Built-in observability: Both the Inference Gateway and agents built using the ADL CLI support OpenTelemetry out of the box. This means you get distributed tracing, metrics, and logging across your entire agent ecosystem without additional configuration - making it easy to integrate with your existing observability stack (Jaeger, Grafana, Datadog, etc.).
This is one of the reasons I don't believe in the future of sub-agents as markdown files. While markdown-based approaches might seem simple at first, they're overly simplified and lack the customization needed for real-world use. We should leverage our existing software stack - compilers aren't going to change, writing Dockerfiles isn't going anywhere, and proper infrastructure tooling (CI/CD, observability, security) is battle-tested. A2A agents integrate with these proven systems rather than trying to reinvent them with text files.
🌐 AI‑Friendly App Design
As more applications are built with AI assistance in mind, they're becoming AI‑friendly by design: interfaces are simpler, APIs are more robust, and common tasks are easily scripted. This shift will make it even easier for browser agents to automate workflows - not by hacking around UIs, but by interacting with well‑structured web apps designed for humans and agents alike.
I believe the future of web apps will be built for both humans and agents. That means:
- Clear, clickable buttons instead of complex drag-and-drop interfaces (which are notoriously difficult for browser agents to handle)
- Predictable navigation patterns that don't rely on hover states or hidden menus
- Semantic HTML that makes it easy for agents to understand page structure
- Stable selectors that don't break with every UI update
In the future, launching a CI pipeline, reviewing dashboards, or collecting analytics will be as straightforward for an agent as it is for a human developer - because the app will have been built to support both.
🛠️ Built-in Tools
The A2A protocol can integrate with MCP (Model Context Protocol) when needed, but for most workflows, it's not necessary. The A2A CLI comes with built-in tools for common tasks like Read, Write, Edit, Web_Search, Grep, Web_Fetch, and Github - similar to how Claude Code works, but with a crucial advantage: you can use any provider (DeepSeek, OpenAI, Anthropic, or others).
API Compatibility Challenge
One early technical challenge was handling different LLM provider message formats. The Inference Gateway needed to transform between Google's Gemini API format and OpenAI's API format to support multiple providers seamlessly. This was the first hurdle to solve for true provider flexibility.
Today, A2A is fully compatible with the standard OpenAI API specification, which most providers now support. This means the transformation layer handles the complexity behind the scenes, allowing you to switch between providers without changing your agent code.
Direct Tool Calling
The Inference Gateway CLI lets you call the same tools the LLM uses - directly, without going through the model. Simply prefix your command with !! and an autocomplete dropdown shows all available tools. This enables:
- Troubleshooting - Test and debug tool functions before letting the LLM use them
- Token savings - Mix deterministic outputs (direct tool calls) with non-deterministic LLM reasoning
- Hybrid workflows - Combine predictable operations with AI decision-making where it matters most
For example, !!Read(content="./README.md") executes the Read tool directly, saving tokens when you know exactly what you need. Or if you just want to run the bash tool !ls -la lists files without involving the LLM at all.
For security, bash commands can be whitelisted using standard regex patterns, giving you fine-grained control over which commands are allowed - perfect for production environments where you need to restrict potentially dangerous operations.
Dynamic Prompt Injections
The CLI supports dynamic prompt injections - a powerful feature that lets you inject reminders or instructions into the conversation at regular intervals. Configure the system to inject a prompt every 4th or 5th message (fully configurable) to keep the agent focused on specific behaviors or constraints.
For example, you might inject "Remember to prioritize security and validate all inputs" every 5 turns, or "Keep responses concise and code-focused" every 4 turns. This ensures long-running agent sessions stay aligned with your requirements without manual intervention, especially useful for maintaining consistent behavior in extended workflows.
A2A Communication Tools
Beyond standard tools, A2A includes specialized communication tools that enable true autonomous agent collaboration:
A2A_SubmitTask- Delegate work to specialized agentsA2A_QueryTask- Check task status and resultsA2A_QueryAgent- Discover agent capabilitiesA2A_DownloadArtifacts- Retrieve generated files and outputs
These tools allow agents to coordinate complex workflows autonomously - submitting tasks, monitoring progress, and collecting results without human intervention.
Skills vs Tools
An important distinction: skills are what agents expose to clients as high-level capabilities, while tools are the atomic operations available to the agent. A single skill might orchestrate multiple tool calls behind the scenes. For example, a "code review" skill might internally use Read to fetch files, Grep to search for patterns, Web_Search to look up best practices, and Write to generate the review report - all coordinated automatically. This abstraction allows agents to offer powerful, domain-specific capabilities while hiding implementation complexity from the client.
🔍 Running A2A Without an LLM
One of the most powerful yet underappreciated aspects of the A2A protocol is that you don't always need an LLM at all. While AI-powered reasoning is valuable for complex decision-making, many automation tasks can be solved more efficiently with deterministic approaches.
Pattern Matching & Programmatic Logic
The A2A agents are just containers running code - they can use traditional programming techniques instead of LLM inference:
- Regex pattern matching - Extract structured data from web pages, logs, or documents
- XPath/CSS selectors - Navigate DOM elements precisely without vision models
- Conditional logic - Make decisions based on explicit rules and business logic
- Data transformation - Parse, filter, and reshape data using traditional programming
For example, a browser agent could scrape a dashboard, extract metrics using CSS selectors, compare values against thresholds, and trigger alerts - all without ever calling an LLM. The A2A protocol doesn't care how the agent achieves its goal; it just defines how agents communicate.
Semantic Search Without LLMs
Another compelling approach is semantic search using embedding models - which are much faster and cheaper than full LLM inference:
- Vector embeddings - Use models like
sentence-transformersor OpenAI's embedding API to convert text to vectors - Similarity matching - Find relevant content by comparing embedding distances (cosine similarity)
- Retrieval augmentation - Search documentation, code, or knowledge bases semantically
- Classification - Categorize content based on semantic similarity to known examples
This hybrid approach lets you combine the speed and cost-effectiveness of embeddings with the reasoning power of LLMs only when needed. For instance, an agent could use semantic search to find relevant documentation chunks, then only invoke an LLM to synthesize a final answer.
When to Skip the LLM
Consider using deterministic logic or embeddings instead of LLM inference when:
- The task is well-defined - Rules are explicit and don't require interpretation
- Speed matters - Pattern matching is orders of magnitude faster than LLM calls
- Cost is a concern - Avoid token costs when traditional code works fine
- Consistency is critical - Deterministic outputs are predictable and reproducible
The A2A architecture supports this flexibility naturally. You can build agents that mix approaches - using embeddings for retrieval, regex for extraction, and LLMs only for the final reasoning step where natural language understanding truly adds value.
🏠 Self-Hosted Freedom
The entire stack - Inference Gateway, CLI, and all A2A agents - can run on your own infrastructure. The Inference Gateway is cloud-native and designed for Kubernetes deployment, making it a good fit for enterprise environments. This gives you complete control over your data, models, and workflows. No vendor lock-in, no data leaving your premises, just pure automation under your control.
🧐 Takeaway
Developers don't need magic. They need automation they can understand, extend, and trust.
With Inference Gateway's CLI + A2A Agents, you get:
- Open standards - A team of specialized agents cooperating over the A2A protocol, not vendor-locked APIs
- Built-in productivity - Tools for common tasks and agent-to-agent communication, ready out of the box
- Provider flexibility - Use any LLM provider (DeepSeek, OpenAI, Anthropic, etc.) without vendor lock-in
- Self-hosted control - Run everything on your own infrastructure with complete control over data and models
- Easy scaffolding - Build new specialized agents as your business needs grow using the ADL CLI
- Architecture flexibility - A2A is an architecture for agent collaboration, not an LLM requirement. Use the right tool for each subtask - whether that's a regex, an embedding model, or a full LLM - and orchestrate them together through the A2A protocol
That's not "Computer‑Use" with a heavyweight model.
That's a fleet of specialized agents communicating with each other to achieve your goals.
A final thought: Many people focus solely on which LLM is "good" or "bad" - and while model quality matters, the tooling around the LLM is what truly enables it to work effectively. A powerful model with poor tooling will underperform, while even a smaller, faster model with the right tools and architecture can deliver exceptional results. The A2A approach proves this: it's not just about the model - it's about the ecosystem that empowers it.
Try it yourself:
👉 inference gateway/cli/examples/a2a
