Watching an agent work

I pointed at a button in a browser the agent was driving, said “this one,” and on its next turn it clicked exactly that button. No description, no selector, no “the blue one near the top of the second card.” I put my cursor on the thing and it understood the thing. That small moment is the one that reorganized how I think about working with these systems, more than any benchmark has.

Most of the conversation about agents is about the acting: can it use the tools, can it finish the task, can you leave it alone. The acting was never the hard part to imagine. What I had not appreciated until I built a place to do it is the value of co-presence: a real browser, live on the screen, that the agent is driving and I am watching at the same time, where I can see precisely what it sees and reach in. It is the difference between hiring someone and reading their report afterward, and sitting next to them while they work. The report tells you what happened. Sitting beside them lets you say “wait, not that one” before it matters.

Pointing is the channel that makes co-presence more than spectating. When the agent and I are looking at the same rendered page, the most natural thing a human can do is gesture, and gesture turns out to carry information that words fumble. Describing a UI element to a machine is its own little translation task, error-prone in both directions, and it disappears entirely if you can just put your cursor on the pixel and let the system resolve what sits underneath. The point I make is stashed, and on the agent’s next turn it arrives as part of the instruction: the operator is looking at this coordinate, act there. I stopped narrating the interface to the agent and started doing what I would do with a colleague at my shoulder, which is jab a finger at the screen.

One browser, two drivers. The point you make waits, then enters the agent's next turn as a coordinate.

Once the browser is a shared surface rather than a private one the agent owns, the rest follows without being designed. More than one person can watch the same session at once, from a laptop and a phone, because watching is just subscribing and there is no reason a stream has only one reader. Either watcher can take the controls. Taking the controls locks the agent out for the moment, so two hands are never fighting over the same input, and handing them back lets it resume. The model I kept reaching for is not a service you call. It is a participant in a room you are also in, and the room has a door you can both use, one at a time.

The honesty I owe this is that the whole thing rests on the agent working somewhere I can see. Pointing works because the browser is a real surface with real coordinates, rendered in front of both of us. The instant the agent does something off that surface, edits a file I do not have open, calls an interface that returns no picture, takes a step that produces a line in a log and nothing on the screen, the co-presence is gone and I am back to trusting a record after the fact. So the place I have made it genuinely good to supervise an agent is exactly the place where its work happens to be visual, and most of what these systems do is not. Pointing at a finger does not help when the work has no screen to point at.