Archon: Using GPT-5 to Control Your Computer with Natural Language

Tools & Engineering

The Engineer

18 Aug 2025 · 3 min read

Archon transforms how you interact with computers by understanding and executing complex tasks through simple, spoken instructions, thanks to GPT-5's sophisticated language processing.

Over the weekend, I took home third place at OpenAI's GPT-5 Hackathon with a project called Archon. This innovative tool acts as a copilot for your computer, allowing you to control it using natural language commands. The hack leverages GPT-5’s advanced reasoning capabilities and a mini vision model to execute tasks efficiently.

How Archon Works

Archon is designed to sit at the bottom of your Mac or Windows screen, where you can input what you want your computer to do in plain English. Here’s a breakdown of its architecture:

User Intent: You provide natural language commands.
Planner (GPT-5): GPT-5 processes these commands and plans the necessary actions.
Fast Click Grounding (prava-fc-small): A custom fine-tuned model executes clicks and keystrokes based on the plan.
Executor: The final step where the actual interactions with your computer occur.

Key Components

Vision Model

Archon uses a mini vision model to capture screenshots of your screen. This is crucial for understanding the current state of the interface, especially in dynamic applications like games or web browsers. The screenshot process is quick, taking only about 10 milliseconds.

Why it matters: Real-time visual feedback ensures that Archon can adapt to changes in the UI and perform actions accurately.

GPT-5 Reasoning

GPT-5's reasoning capabilities are the backbone of Archon. Here’s how we utilized different aspects of GPT-5:

High Thinking Mode: This mode allows GPT-5 to break down complex, multi-step processes into discrete, executable steps while maintaining context across long interactions.
- Example: In a racing game demo, a single command "start playing" was broken down into recognizing the view, using WASD controls, and navigating the track.
Vision Mode: GPT-5 can perceive the screen and understand visual elements, which is essential for tasks that require visual input.
Function Calling Preambles: These enable Archon to show users what it’s thinking while simultaneously calling the grounding model to execute actions.

Compute Trade-offs

We strategically calibrated how much compute to use based on the complexity of the task:

High Reasoning Effort: For complex workflows, GPT-5 maps out interaction sequences with error handling.
Low Latency Mode: Using GPT-5-mini with function calling preambles allows for faster execution while still maintaining user awareness.

Demo and Performance

In a racing game demo, Archon demonstrated its ability to follow instructions accurately. While it didn’t win the race due to latency issues, its instruction-following capability was superior to previous models.

What it means: This shows that GPT-5’s reasoning can handle complex tasks in real-world scenarios, even if there are some performance bottlenecks to address.

Future Goals

The ultimate goal of Archon is to make computers self-driving. By combining GPT-5's powerful reasoning with tiny fine-tuned models, we aim to control any interface through natural language commands. This could revolutionize how users interact with their devices, making complex tasks simpler and more intuitive.

Conclusion

Archon is a promising step towards a future where your computer can understand and execute your commands as if it were an assistant. The combination of GPT-5’s advanced reasoning and real-time visual feedback makes this possible. While there are still challenges to overcome, the potential applications are vast.