Is Attention All You Need?

Named after the landmark
2017 paper “Attention Is
All You Need” by Vaswani
et al. — the paper that
introduced Transformers
and changed everything.

In 2017, eight Google researchers published a paper that changed everything. “Attention Is All You Need” gave us the Transformer — the architecture behind every LLM you have talked to this year. For the neural network itself, attention really is enough.

For you, the developer building with these models? That is a very different question.

A raw LLM is a text-in, text-out function. No memory, no tools, no way to act on the world. To build something useful — something that can search, compute, read files, and loop until it finds an answer — you need scaffolding. That scaffolding is what we call an agent.

This post walks through exactly that construction, from a single HTTP call to a proper autonomous loop, using Ollama and C++ (the king of all languages). Same patterns apply in any language.

Setting Up

Local vs Cloud: local
models are free, private,
and offline. Cloud models
(Claude, GPT) are smarter
but cost money and send
your data to a server.
For learning, go local.

First, install Ollama and pull a model:

brew install ollama
ollama pull deepseek-r1:8b
# starts REST API on localhost:11434

For C++ we need two libraries: cpr for HTTP and nlohmann/json for JSON. Both are easily fetched via CMake’s FetchContent or vcpkg:

# CMakeLists.txt excerpt
find_package(cpr REQUIRED)
find_package(nlohmann_json REQUIRED)
target_link_libraries(agent PRIVATE cpr::cpr nlohmann_json::nlohmann_json)

What Is an Agent?

At its core, an agent is: a loop where an LLM decides what to do next.

Simple call — no agent

👤 User prompt

↓

🤖 LLM

↓

💬 Response

Agent loop

👤 User prompt

↓

🤖 LLM thinks

↓

🔍 Parse output

↓

🔧 Run tool

↻ loop until done

✅ Final answer

The model decides what to do. Your code executes it and feeds the result back. Simple contract. Profound consequences.

The Five Stages

Click any stage to explore the code:

The Single Call

Send a prompt, get a response, done. No loop, no memory, no tools. One prompt → one reply. Perfect for self-contained, one-shot tasks where the model needs no external data to answer correctly.

C++ — single Ollama call

#include <cpr/cpr.h>
#include <nlohmann/json.hpp>
#include <iostream>
using json = nlohmann::json;

std::string ask(const std::string& prompt) {
    json body;
    body["model"]  = "deepseek-r1:8b";
    body["prompt"] = prompt;
    body["stream"] = false;

    auto r = cpr::Post(
        cpr::Url{"http://localhost:11434/api/generate"},
        cpr::Header{ {"Content-Type", "application/json"} },
        cpr::Body{body.dump()}
    );
    return json::parse(r.text)["response"];
}

int main() {
    std::cout << ask("What does inakamono mean?") << "\n";
    return 0;
}

✓ Good for

Summarization
One-shot code generation
Translation and reformatting
Single factual questions

✗ Cannot do

Remember earlier messages
Look up real-time information
Take multi-step actions
Recover from wrong guesses

The Conversation Loop

Add a while loop and a message array. Each turn you append the user message, call the model, and append its reply. The model appears to "remember" earlier turns — but really you are just sending its own past responses back to it on every request. This is how every ChatGPT-style interface works.

C++ — chat loop with history

int main() {
    json messages = json::array();   // growing conversation history

    while (true) {
        std::cout << "You: ";
        std::string input;
        std::getline(std::cin, input);
        if (input == "/quit") break;

        // 1. append user message
        json user_msg;
        user_msg["role"]    = "user";
        user_msg["content"] = input;
        messages.push_back(user_msg);

        // 2. call /api/chat (not /api/generate)
        json body;
        body["model"]    = "deepseek-r1:8b";
        body["messages"] = messages;
        body["stream"]   = false;

        auto r = cpr::Post(
            cpr::Url{"http://localhost:11434/api/chat"},
            cpr::Header{ {"Content-Type", "application/json"} },
            cpr::Body{body.dump()}
        );

        std::string reply =
            json::parse(r.text)["message"]["content"];

        // 3. append assistant reply, loop
        json asst;
        asst["role"]    = "assistant";
        asst["content"] = reply;
        messages.push_back(asst);

        std::cout << "Bot: " << reply << "\n";
    }
    return 0;
}

Token cost trap: Every request sends the entire history. A 20-turn conversation costs 20× the tokens of the last message alone. For long sessions consider a sliding window or a periodic summarisation step.

Labels as Primitive Tools

Before OpenAI shipped function calling, this was the trick everyone used. You teach the model — in its system prompt — to output special labelled tags when it wants to do something: [SEARCH: rust lifetimes], [CALC: 42 * 1.5]. Your code scans every response for these labels, executes the matching action, and injects the result back as a new message. The loop continues until the model outputs [FINAL: ...].

It is fragile if the model drifts from the format, but it works on any model — even ones with no native tool-calling support.

System prompt — define the label vocabulary

You are a helpful assistant with access to tools.

To use a tool, output its label on its own line:

  [SEARCH: <query>]        -- search the web for information
  [CALC: <expression>]     -- evaluate a math expression
  [FILE: <path>]           -- read a file from disk
  [RUN: <shell command>]   -- execute a shell command

After each tool call you will receive its output as:
  [RESULT: <output>]

When you have a complete answer, output:
  [FINAL: <your answer here>]

Never guess. Use tools for anything uncertain.

C++ — agent loop with label parsing

#include <regex>
#include <functional>
#include <unordered_map>

using ToolFn = std::function<std::string(const std::string&)>;

int main() {
    // 1. register tools
    std::unordered_map<std::string, ToolFn> tools;
    tools["SEARCH"] = [](const std::string& q)    { return web_search(q);   };
    tools["CALC"]   = [](const std::string& expr) { return evaluate(expr);  };
    tools["FILE"]   = [](const std::string& path) { return read_file(path); };
    tools["RUN"]    = [](const std::string& cmd)  { return shell_exec(cmd); };

    // 2. label regex: matches [NAME: args]
    std::regex label_re(R"(\[([A-Z_]+):\s*([^\]]+)\])");

    json messages = build_system_messages();
    messages.push_back(user_message(initial_question));

    for (int step = 0; step < 15; ++step) {
        std::string reply = chat(messages);
        messages.push_back(assistant_msg(reply));

        // 3. scan response for labels
        std::sregex_iterator it(reply.begin(), reply.end(), label_re);
        std::sregex_iterator end_it;

        if (it == end_it) {
            std::cout << reply << "\n";
            break;
        }

        std::string results;
        for (; it != end_it; ++it) {
            std::string name = (*it)[1];
            std::string args = (*it)[2];

            if (name == "FINAL") {
                std::cout << "Answer: " << args << "\n";
                return 0;
            }
            // 4. execute tool and collect result
            auto fn = tools.find(name);
            if (fn != tools.end())
                results += "[RESULT: " + fn->second(args) + "]\n";
        }
        // 5. inject results, loop again
        messages.push_back(tool_results_msg(results));
    }
}

Why it works: Models are trained on enormous amounts of structured text. Once you show them the label pattern in the system prompt, they reliably output labels in the correct format. You are exploiting instruction-following, not any special API feature.

Structured Function Calling

Modern LLM APIs (OpenAI, Anthropic, Ollama with capable models) support native tool use. Instead of parsing free-form text, you declare your tools as JSON schemas. The model returns a typed tool_call object — no regex, no parsing fragility. The model was fine-tuned on this format, so reliability is dramatically better than the label approach.

JSON — tool schema

{
  "type": "function",
  "function": {
    "name": "search",
    "description": "Search the web for up-to-date information",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "The search query"
        }
      },
      "required": ["query"]
    }
  }
}

C++ — dispatching a tool_call response

// When finish_reason == "tool_calls" the response looks like:
// { "message": { "tool_calls": [
//   { "id": "call_xyz",
//     "function": { "name": "search",
//                   "arguments": "{\"query\":\"C++ cpr library\"}" }}]}}

int main() {
    json messages = build_system();
    messages.push_back(user_message(question));

    while (true) {
        json resp = chat_with_tools(messages, tools_schema);
        messages.push_back(resp["message"]);

        if (resp["finish_reason"] == "stop") {
            std::cout << resp["message"]["content"];
            break;
        }

        for (auto& tc : resp["message"]["tool_calls"]) {
            std::string fn   = tc["function"]["name"];
            json         args = json::parse(
                tc["function"]["arguments"].get<std::string>()
            );

            std::string result = dispatch(fn, args);

            // inject result with role "tool"
            json tool_msg;
            tool_msg["role"]         = "tool";
            tool_msg["tool_call_id"] = tc["id"];
            tool_msg["content"]      = result;
            messages.push_back(tool_msg);
        }
    }
}

The Full ReAct Loop

ReAct (Reasoning + Acting, Yao et al. 2022) is the canonical pattern for autonomous agents. The model writes an explicit Thought before every Action. You inject an Observation. It loops until Final Answer. Making reasoning visible reduces errors — the model catches its own mistakes in the Thought step before acting.

System prompt — ReAct format

Answer the question using this format. Repeat as needed:

Thought: <what you are currently thinking>
Action: <tool name>
Action Input: <input to the tool>
Observation: <you will receive the tool result here>

When you have enough information:
Thought: I now know the final answer.
Final Answer: <the complete answer>

C++ — ReAct parsing and loop

struct ReActStep {
    std::string thought;
    std::string action;
    std::string action_input;
    std::string final_answer;
    bool        is_final = false;
};

ReActStep parse_react(const std::string& text) {
    ReActStep s;
    std::smatch m;
    if (std::regex_search(text, m, std::regex(R"(Thought:\s*(.+?)(?:\n|$))")))
        s.thought = m[1];
    if (std::regex_search(text, m, std::regex(R"(Action:\s*(.+?)(?:\n|$))")))
        s.action = m[1];
    if (std::regex_search(text, m, std::regex(R"(Action Input:\s*(.+?)(?:\n|$))")))
        s.action_input = m[1];
    if (std::regex_search(text, m, std::regex(R"(Final Answer:\s*(.+))"))) {
        s.final_answer = m[1];
        s.is_final = true;
    }
    return s;
}

int main() {
    json messages = build_react_system();
    messages.push_back(user_message(question));

    for (int step = 0; step < 10; ++step) {   // always cap the loop!
        std::string reply = chat(messages);
        messages.push_back(assistant_msg(reply));

        ReActStep s = parse_react(reply);
        std::cerr << "[thought] " << s.thought << "\n";

        if (s.is_final) {
            std::cout << "Answer: " << s.final_answer << "\n";
            return 0;
        }

        std::string obs = run_tool(s.action, s.action_input);
        messages.push_back(observation_msg(obs));
    }
    std::cerr << "[warn] max steps reached\n";
}

Always cap the loop. Without a step limit a confused model can loop indefinitely. At $15 / MTok for premium models that is an expensive mistake. Ten iterations is a reasonable default; raise it only for known deep-research tasks.

Label Playground

This is Stage 3 in action.
Edit the fake LLM output
and watch the parser
extract labels in real time
— exactly what your C++
loop would do.

LLM output (editable)

Parsed labels → tool dispatch

Recognised: [SEARCH: ...] [CALC: ...] [FILE: ...] [RUN: ...] [FINAL: ...]

Watch the ReAct Loop Animate

Notice the LLM is called
twice for this question.
Once to decide the tool,
once to synthesise the
result. Each call costs
tokens — that is the
trade-off for reasoning.

👤

User question received

🤖

LLM — first thought

Thought: I need to check the file size. I will run a shell command. Action: RUN Action Input: stat -f%z /etc/hosts

🔧

Tool executed

$ stat -f%z /etc/hosts → 213

👁

Observation injected into context

Observation: 213

🤖

LLM — final reasoning

Thought: I have the size. I can now answer. Final Answer: The file /etc/hosts is 213 bytes.

✅

Final answer returned to user

The file /etc/hosts is 213 bytes.

All Five Stages at a Glance

Stage	Memory	Tools	Loop	Reasoning trace	Best for
① Raw call	✗	✗	✗	✗	One-shot tasks
② Loop	✓	✗	✓	✗	Chat interfaces
③ Labels	✓	~	✓	✗	Any model, quick prototypes
④ Functions	✓	✓	✓	✗	Production agents
⑤ ReAct	✓	✓	✓	✓	Complex multi-step reasoning

So, Is Attention All You Need?

My honest answer: for a
demo, yes. For anything
you’d run in production,
no. The loop, the tools,
and the guardrails are
what separate a toy
from a product.

For a neural network architecture? Apparently yes — the Transformer proved that.

For an AI product? No. Attention handles the hard part (understanding and generating language), but you also need:

A loop to handle multi-step tasks
Tools to interact with the real world
Memory to maintain context across turns
A step cap to prevent runaway costs and hallucination spirals

The journey from Stage 1 to Stage 5 is a journey from “calling a smart autocomplete” to “building infrastructure for autonomous decision-making.” Each stage adds one primitive — loop, tool dispatch, structured calls, explicit reasoning — and the emergent behaviour changes dramatically.

Start at Stage 1. Add a loop when you need memory. Use labels when your model has no native tool-calling. Graduate to function calls when reliability matters. Add explicit reasoning when you need to debug complex chains.

The model’s attention is the engine. Everything else is the car.

Further reading: ReAct paper (Yao et al. 2022), Ollama API docs, cpr HTTP library, nlohmann/json.