Is Attention All You Need?
Named after the landmark
2017 paper “Attention Is
All You Need” by Vaswani
et al. — the paper that
introduced Transformers
and changed everything.
In 2017, eight Google researchers published a paper that changed everything. “Attention Is All You Need” gave us the Transformer — the architecture behind every LLM you have talked to this year. For the neural network itself, attention really is enough.
For you, the developer building with these models? That is a very different question.
A raw LLM is a text-in, text-out function. No memory, no tools, no way to act on the world. To build something useful — something that can search, compute, read files, and loop until it finds an answer — you need scaffolding. That scaffolding is what we call an agent.
This post walks through exactly that construction, from a single HTTP call to a proper autonomous loop, using Ollama and C++ (the king of all languages). Same patterns apply in any language.
Setting Up
Local vs Cloud: local
models are free, private,
and offline. Cloud models
(Claude, GPT) are smarter
but cost money and send
your data to a server.
For learning, go local.
First, install Ollama and pull a model:
brew install ollama
ollama pull deepseek-r1:8b
# starts REST API on localhost:11434
For C++ we need two libraries: cpr for HTTP and nlohmann/json for JSON.
Both are easily fetched via CMake’s FetchContent or vcpkg:
# CMakeLists.txt excerpt
find_package(cpr REQUIRED)
find_package(nlohmann_json REQUIRED)
target_link_libraries(agent PRIVATE cpr::cpr nlohmann_json::nlohmann_json)
What Is an Agent?
At its core, an agent is: a loop where an LLM decides what to do next.
The model decides what to do. Your code executes it and feeds the result back. Simple contract. Profound consequences.
The Five Stages
Click any stage to explore the code:
Send a prompt, get a response, done. No loop, no memory, no tools. One prompt → one reply. Perfect for self-contained, one-shot tasks where the model needs no external data to answer correctly.
#include <cpr/cpr.h> #include <nlohmann/json.hpp> #include <iostream> using json = nlohmann::json; std::string ask(const std::string& prompt) { json body; body["model"] = "deepseek-r1:8b"; body["prompt"] = prompt; body["stream"] = false; auto r = cpr::Post( cpr::Url{"http://localhost:11434/api/generate"}, cpr::Header{ {"Content-Type", "application/json"} }, cpr::Body{body.dump()} ); return json::parse(r.text)["response"]; } int main() { std::cout << ask("What does inakamono mean?") << "\n"; return 0; }
- Summarization
- One-shot code generation
- Translation and reformatting
- Single factual questions
- Remember earlier messages
- Look up real-time information
- Take multi-step actions
- Recover from wrong guesses
Add a while loop and a message array. Each turn you append the user message,
call the model, and append its reply. The model appears to "remember" earlier turns —
but really you are just sending its own past responses back to it on every request.
This is how every ChatGPT-style interface works.
int main() { json messages = json::array(); // growing conversation history while (true) { std::cout << "You: "; std::string input; std::getline(std::cin, input); if (input == "/quit") break; // 1. append user message json user_msg; user_msg["role"] = "user"; user_msg["content"] = input; messages.push_back(user_msg); // 2. call /api/chat (not /api/generate) json body; body["model"] = "deepseek-r1:8b"; body["messages"] = messages; body["stream"] = false; auto r = cpr::Post( cpr::Url{"http://localhost:11434/api/chat"}, cpr::Header{ {"Content-Type", "application/json"} }, cpr::Body{body.dump()} ); std::string reply = json::parse(r.text)["message"]["content"]; // 3. append assistant reply, loop json asst; asst["role"] = "assistant"; asst["content"] = reply; messages.push_back(asst); std::cout << "Bot: " << reply << "\n"; } return 0; }
Before OpenAI shipped function calling, this was the trick everyone used.
You teach the model — in its system prompt — to output special labelled tags
when it wants to do something: [SEARCH: rust lifetimes],
[CALC: 42 * 1.5].
Your code scans every response for these labels, executes the matching action,
and injects the result back as a new message. The loop continues until the model
outputs [FINAL: ...].
It is fragile if the model drifts from the format, but it works on any model — even ones with no native tool-calling support.
You are a helpful assistant with access to tools. To use a tool, output its label on its own line: [SEARCH: <query>] -- search the web for information [CALC: <expression>] -- evaluate a math expression [FILE: <path>] -- read a file from disk [RUN: <shell command>] -- execute a shell command After each tool call you will receive its output as: [RESULT: <output>] When you have a complete answer, output: [FINAL: <your answer here>] Never guess. Use tools for anything uncertain.
#include <regex> #include <functional> #include <unordered_map> using ToolFn = std::function<std::string(const std::string&)>; int main() { // 1. register tools std::unordered_map<std::string, ToolFn> tools; tools["SEARCH"] = [](const std::string& q) { return web_search(q); }; tools["CALC"] = [](const std::string& expr) { return evaluate(expr); }; tools["FILE"] = [](const std::string& path) { return read_file(path); }; tools["RUN"] = [](const std::string& cmd) { return shell_exec(cmd); }; // 2. label regex: matches [NAME: args] std::regex label_re(R"(\[([A-Z_]+):\s*([^\]]+)\])"); json messages = build_system_messages(); messages.push_back(user_message(initial_question)); for (int step = 0; step < 15; ++step) { std::string reply = chat(messages); messages.push_back(assistant_msg(reply)); // 3. scan response for labels std::sregex_iterator it(reply.begin(), reply.end(), label_re); std::sregex_iterator end_it; if (it == end_it) { std::cout << reply << "\n"; break; } std::string results; for (; it != end_it; ++it) { std::string name = (*it)[1]; std::string args = (*it)[2]; if (name == "FINAL") { std::cout << "Answer: " << args << "\n"; return 0; } // 4. execute tool and collect result auto fn = tools.find(name); if (fn != tools.end()) results += "[RESULT: " + fn->second(args) + "]\n"; } // 5. inject results, loop again messages.push_back(tool_results_msg(results)); } }
Modern LLM APIs (OpenAI, Anthropic, Ollama with capable models) support
native tool use.
Instead of parsing free-form text, you declare your tools as JSON schemas.
The model returns a typed tool_call object — no regex,
no parsing fragility. The model was fine-tuned on this format, so reliability
is dramatically better than the label approach.
{
"type": "function",
"function": {
"name": "search",
"description": "Search the web for up-to-date information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
// When finish_reason == "tool_calls" the response looks like: // { "message": { "tool_calls": [ // { "id": "call_xyz", // "function": { "name": "search", // "arguments": "{\"query\":\"C++ cpr library\"}" }}]}} int main() { json messages = build_system(); messages.push_back(user_message(question)); while (true) { json resp = chat_with_tools(messages, tools_schema); messages.push_back(resp["message"]); if (resp["finish_reason"] == "stop") { std::cout << resp["message"]["content"]; break; } for (auto& tc : resp["message"]["tool_calls"]) { std::string fn = tc["function"]["name"]; json args = json::parse( tc["function"]["arguments"].get<std::string>() ); std::string result = dispatch(fn, args); // inject result with role "tool" json tool_msg; tool_msg["role"] = "tool"; tool_msg["tool_call_id"] = tc["id"]; tool_msg["content"] = result; messages.push_back(tool_msg); } } }
ReAct (Reasoning + Acting, Yao et al. 2022) is the canonical pattern for autonomous agents. The model writes an explicit Thought before every Action. You inject an Observation. It loops until Final Answer. Making reasoning visible reduces errors — the model catches its own mistakes in the Thought step before acting.
Answer the question using this format. Repeat as needed: Thought: <what you are currently thinking> Action: <tool name> Action Input: <input to the tool> Observation: <you will receive the tool result here> When you have enough information: Thought: I now know the final answer. Final Answer: <the complete answer>
struct ReActStep { std::string thought; std::string action; std::string action_input; std::string final_answer; bool is_final = false; }; ReActStep parse_react(const std::string& text) { ReActStep s; std::smatch m; if (std::regex_search(text, m, std::regex(R"(Thought:\s*(.+?)(?:\n|$))"))) s.thought = m[1]; if (std::regex_search(text, m, std::regex(R"(Action:\s*(.+?)(?:\n|$))"))) s.action = m[1]; if (std::regex_search(text, m, std::regex(R"(Action Input:\s*(.+?)(?:\n|$))"))) s.action_input = m[1]; if (std::regex_search(text, m, std::regex(R"(Final Answer:\s*(.+))"))) { s.final_answer = m[1]; s.is_final = true; } return s; } int main() { json messages = build_react_system(); messages.push_back(user_message(question)); for (int step = 0; step < 10; ++step) { // always cap the loop! std::string reply = chat(messages); messages.push_back(assistant_msg(reply)); ReActStep s = parse_react(reply); std::cerr << "[thought] " << s.thought << "\n"; if (s.is_final) { std::cout << "Answer: " << s.final_answer << "\n"; return 0; } std::string obs = run_tool(s.action, s.action_input); messages.push_back(observation_msg(obs)); } std::cerr << "[warn] max steps reached\n"; }
Label Playground
This is Stage 3 in action.
Edit the fake LLM output
and watch the parser
extract labels in real time
— exactly what your C++
loop would do.
[SEARCH: ...]
[CALC: ...]
[FILE: ...]
[RUN: ...]
[FINAL: ...]
Watch the ReAct Loop Animate
Notice the LLM is called
twice for this question.
Once to decide the tool,
once to synthesise the
result. Each call costs
tokens — that is the
trade-off for reasoning.
All Five Stages at a Glance
| Stage | Memory | Tools | Loop | Reasoning trace | Best for |
|---|---|---|---|---|---|
| ① Raw call | ✗ | ✗ | ✗ | ✗ | One-shot tasks |
| ② Loop | ✓ | ✗ | ✓ | ✗ | Chat interfaces |
| ③ Labels | ✓ | ~ | ✓ | ✗ | Any model, quick prototypes |
| ④ Functions | ✓ | ✓ | ✓ | ✗ | Production agents |
| ⑤ ReAct | ✓ | ✓ | ✓ | ✓ | Complex multi-step reasoning |
So, Is Attention All You Need?
My honest answer: for a
demo, yes. For anything
you’d run in production,
no. The loop, the tools,
and the guardrails are
what separate a toy
from a product.
For a neural network architecture? Apparently yes — the Transformer proved that.
For an AI product? No. Attention handles the hard part (understanding and generating language), but you also need:
- A loop to handle multi-step tasks
- Tools to interact with the real world
- Memory to maintain context across turns
- A step cap to prevent runaway costs and hallucination spirals
The journey from Stage 1 to Stage 5 is a journey from “calling a smart autocomplete” to “building infrastructure for autonomous decision-making.” Each stage adds one primitive — loop, tool dispatch, structured calls, explicit reasoning — and the emergent behaviour changes dramatically.
Start at Stage 1. Add a loop when you need memory. Use labels when your model has no native tool-calling. Graduate to function calls when reliability matters. Add explicit reasoning when you need to debug complex chains.
The model’s attention is the engine. Everything else is the car.
Further reading: ReAct paper (Yao et al. 2022), Ollama API docs, cpr HTTP library, nlohmann/json.