Manga Kotoba, Part 2: Scraping, AI Extraction, and the Vocabulary Engine

This is Part 2 of 5.
If you missed Part 1,
it covers the overall
architecture and the
iOS app design.

An app that shows vocabulary is useless without vocabulary data. That sounds obvious, but I didn’t fully appreciate it until I had a working Symfony API, a polished SwiftUI frontend, and absolutely nothing to put in the database. Part 2 is about solving that problem — and the three increasingly-interesting approaches I took to do it.

The short version: I tried manual entry, I tried scraping, and I ended up building an AI extraction pipeline on top of Gemini’s vision API. Each approach taught me something, and the final system is a combination of all three working together.

The Data Problem

A typical manga volume
has around 180 pages.
Each page can have
20–40 vocabulary items.
That’s ~5,000 words
per volume — by hand,
completely impossible.

The vocabulary that powers Manga Kotoba comes from annotated manga pages. Each page has a list of words with their readings and meanings. The iOS app surfaces those words while you read, lets you mark them as known, and tracks your progress across volumes.

Three approaches, in order of how naive I was being at the time:

Manual entry — I actually tried this for about four hours. I got through twelve pages of one volume. At that rate it would take me approximately six months to import a single series. Hard no.
Web scraping — there are a handful of sites that annotate manga vocabulary. If I could parse their HTML, I’d have data instantly. This worked, kind of, and I’ll show you the code.
AI extraction from images — give Gemini a manga page image, get back a structured CSV. This turned out to be the real answer.

The Scraper Architecture

KotobaScraper uses
Symfony’s HttpClient
and DomCrawler. No
Puppeteer, no headless
browser — the source
sites render server-side
HTML, which helps.

Click any stage in the pipeline below to see the relevant code.

⚙️ Scraping pipeline

🌐

Source URL

→

📡

HttpClient fetch

→

🔍

DomCrawler parse

→

📦

DTOs

→

💾

Doctrine persist

The Scraper Code

The scraper is a single KotobaScraper class that walks the site’s URL hierarchy: series → volumes → paginated pages → word rows. Commit d18ab71 has the full parser.

// src/Scraper/KotobaScraper.php (simplified)
final class KotobaScraper
{
    private const BASE_URL = 'https://manga-kotoba.com';

    public function __construct(
        private readonly HttpClientInterface $httpClient,
    ) {}

    public function scrapeSeries(string $seriesUrl, int $maxVolumes = 0): SeriesData
    {
        $html    = $this->fetch($seriesUrl);
        $crawler = new Crawler($html, $seriesUrl);

        $title  = trim($crawler->filter('h1')->first()->text(''));
        $author = null;

        // Author link is inside #series-infobox, starts with /by/
        $crawler->filter('#series-infobox div > a')->each(
            function (Crawler $node) use (&$author) {
                if (str_starts_with($node->attr('href') ?? '', '/by/')) {
                    $author = trim($node->text(''));
                }
            }
        );

        $volumeSlugs = [];
        $crawler->filter('#series-volume-stats tbody tr td a')->each(
            function (Crawler $node) use (&$volumeSlugs) {
                $href = $node->attr('href') ?? '';
                if (str_contains($href, '/volume/')) {
                    $volumeSlugs[] = $this->slugFromUrl(self::BASE_URL . $href, 'volume');
                }
            }
        );

        if ($maxVolumes > 0) {
            $volumeSlugs = array_slice($volumeSlugs, 0, $maxVolumes);
        }

        $volumes = array_map(fn($slug) => $this->scrapeVolume($slug), $volumeSlugs);

        return new SeriesData($title, slugify($title), $author, ..., $volumes);
    }
}

The duplicate-key bug (`5a22fa8`)

Doctrine’s identity map
only tracks flushed
entities. Unflushed
new entities are
invisible to find()
— which is what bit me
here.

This was a fun one. The scraper called findOrCreateVocabulary($sourceId) for each word. When the same word appeared on two different manga pages in a single scrape run, the second call to find() returned null — because the first entity had been staged in Doctrine’s unit of work but not yet flushed to the database. Two new Vocabulary objects with the same source_id ended up in the same flush, and the UNIQUE constraint exploded.

The fix was a local $vocabCache array keyed by sourceId, passed by reference through the persistence loop. Three-level lookup: in-memory cache first, then Doctrine identity map, then a real DB query.

// Three-level lookup — no more duplicate-key errors
private function findOrCreate(WordData $data, array &$cache): Vocabulary
{
    // 1. in-memory cache (same scrape session)
    if (isset($cache[$data->sourceId])) {
        return $cache[$data->sourceId];
    }

    // 2. Doctrine identity map / DB
    $vocab = $this->vocabRepo->findOneBy(['sourceId' => $data->sourceId]);

    // 3. create new entity
    if ($vocab === null) {
        $vocab = Vocabulary::fromWordData($data);
        $this->em->persist($vocab);
    }

    return $cache[$data->sourceId] = $vocab;
}

The Pivot: AI Extraction with Gemini

Gemini’s vision API is
surprisingly good at
reading manga — even
handwritten speech
bubbles in messy fonts.

The scraper worked well for the one site that had clean, parseable HTML. The problem: every other vocabulary source had different markup, different pagination patterns, different CSS selectors. Maintaining five separate HTML parsers felt like a maintenance nightmare.

The real answer was simpler: manga pages are images. And Gemini 2.5 Flash can read images.

Instead of scraping vocabulary from annotated HTML, I could take the original manga page scans, send them to Gemini, and ask it to extract the vocabulary directly. The model sees the speech bubbles, the furigana, the sound effects — and returns a structured CSV.

Here’s the prompt that ended up in AiProcessMangaCommand.php after three iterations:

// The final prompt (commit 653d10d)
$prompt = "Extract all Japanese vocabulary from this manga page."
         . " Use the dictionary form. Avoid particles and intensifiers."
         . " Format output as CSV (no header). Two row types:\n"
         . "\n"
         . "  Metadata row: <PAGE_NUM>,caption,notes,,,\n"
         . "  Vocabulary row: <PAGE_NUM>,,,word,reading,position\n"
         . "\n"
         . "Caption values: Cover, Index, Extra, Chapter 1 ...\n"
         . "Only return CSV lines. No markdown. Dictionary form always.";

And an example of the output Gemini produces:

# 6-column CSV output from Gemini
3,Chapter 1,Weather and everyday expressions,,,
3,,,湿気,しっけ,1
3,,,蒸し暑い,むしあつい,2
3,,,大丈夫,だいじょうぶ,3
4,Chapter 1,,,,
4,,,外,そと,1
4,,,遊ぶ,あそぶ,2

Why 6 columns? The 6-column format was the third iteration of the AI prompt. The first version just asked for word and reading. The second added meaning. The third added caption and notes — context about the page itself that ends up in the admin panel and helps the iOS app group vocabulary by chapter. Less is not always more when it comes to structured extraction.

The command walks a folder of numbered images (1.jpg, 2.jpg, …), processes each through Gemini’s generateContent endpoint with the image base64-encoded, and streams the results into a single import.csv:

$payload = [
    'contents' => [[
        'parts' => [
            ['text' => $prompt],
            [
                'inline_data' => [
                    'mime_type' => $mimeType,
                    'data'      => base64_encode(file_get_contents($imagePath)),
                ],
            ],
        ],
    ]],
];

$response = $this->httpClient->request('POST', self::GEMINI_API_URL, [
    'query'   => ['key' => $this->googleApiKey],
    'json'    => $payload,
    'timeout' => 60,
]);

$text = $response->toArray()['candidates'][0]['content']['parts'][0]['text'];

The CSV Import Pipeline

CsvImportService is
the bridge between
the AI-generated CSV
and the Doctrine
entities. It auto-detects
4-column vs 6-column
format.

Once you have the CSV, you import it into a volume through the admin panel. The demo below simulates what the import process looks like for a small file. Hit Import and watch each row get processed.

📂 CSV import simulator

page,,,word,reading,position
3,,,湿気,しっけ,1
3,,,蒸し暑い,むしあつい,2
4,,,大丈夫,だいじょうぶ,1
4,,,外,そと,2
3,,,湿気,しっけ,3
5,,,遊ぶ,あそぶ,1

Imported

Duplicate

Errors

The CsvImportService in PHP mirrors exactly what the demo shows. It reads the file, validates each row, checks for duplicates, and persists via Doctrine:

// src/Service/CsvImportService.php — validation loop (simplified)
while (($data = fgetcsv($csvHandle, null, ',', '"', '\\')) !== false) {
    $line++;

    // Skip header if present
    if ($line === 1 && strtolower((string) ($data[0] ?? '')) === 'page') {
        continue;
    }

    // Auto-detect 6-col vs 4-col format
    $is6Col = count($data) >= 6;
    $pageNum = (int) $data[0];
    $text    = $is6Col ? ($data[3] ?? '') : ($data[1] ?? '');
    $reading = $is6Col ? ($data[4] ?? '') : ($data[2] ?? '');

    // Metadata rows have empty word column
    if (trim($text) === '') {
        $this->handleMetadataRow($data, $volume);
        continue;
    }

    $rows[] = [$pageNum, $text, $reading, (int) $data[5] ?? 0, $line];
    $neededTexts[]   = $text;
    $neededPageNums[] = $pageNum;
}

The Batch-Loading Fix (`792920c`)

Batch flushing is
Doctrine 101, but it’s
the kind of thing you
only learn after
watching a 200-row
import take 45 seconds
and timing out on
Heroku.

The original import worked, but it was slow. Embarrassingly slow. Each row triggered its own findVocabulary() DB query and its own $em->flush(). For a 1,000-row CSV that meant ~2,000 SQL queries. On Heroku, that reliably hit the H12 30-second timeout.

The fix was a 4-pass batch approach:

Pass 1 — parse the entire CSV into memory (no DB)
Pass 2 — create all missing pages in one flush
Pass 3 — batch-load all vocabulary with a single SELECT ... WHERE text IN (...)
Pass 4 — assemble INSERT tuples and fire one INSERT IGNORE

⏱ Import performance: before vs after

Before (per-row flush) —

~2,000 SQL queries for a 1,000-row CSV — H12 timeout on Heroku

After (4-pass batch) —

~4 SQL queries total — runs in under 3 seconds

The key change — going from N individual flushes to a single batched INSERT:

// BEFORE — one flush per row
foreach ($rows as $row) {
    $vocab = $this->findVocabulary($row['text'], $row['reading']);  // 1 SELECT each
    $this->em->persist($vocab);
    $this->em->flush();  // 1 INSERT each — 1000 rows = 2000 queries
}

// AFTER — batch select + single INSERT
$vocabMap = $this->batchLoadVocabulary($uniqueTexts);   // 1 SELECT

$tuples = [];
foreach ($rows as $row) {
    $vocabId  = $vocabMap[$row['text']]?->getId();
    $pageId   = $pages[$row['page']];
    $tuples[] = '(' . $pageId . ',' . $vocabId . ',' . $row['pos'] . ')';
}

// One INSERT IGNORE for the whole file
$this->em->getConnection()->executeStatement(
    'INSERT IGNORE INTO page_vocabulary (page_id, vocabulary_id, position) VALUES '
    . implode(',', $tuples)
);

Vocabulary Meaning Entities (`e5970a3`)

A word like 大丈夫 has
multiple senses: “OK”,
“safe”, “certainly”.
Collapsing them into
a single text column
made filtering by
part-of-speech
impossible.

The original schema had a single meaning TEXT column on both Vocabulary and Kanji. That worked fine until I needed to filter vocabulary by part of speech, or display individual senses in the iOS app. Commit e5970a3 replaced both columns with proper one-to-many relationships — VocabularyMeaning and KanjiMeaning entities with a position field for ordering.

🗂 Entity relationships

Vocabulary

id: int PK

text: string

reading: ?string

frequency: ?int

sourceId: ?string

1 ──<

VocabularyMeaning

id: int PK

vocabulary_id: FK

definition: string

partOfSpeech: ?string

position: int

Kanji

id: int PK

character: string

frequency: ?int

1 ──<

KanjiMeaning

id: int PK

kanji_id: FK

meaning: string

position: int

A backward-compatible getMeaning() helper on Vocabulary returns the first meaning from the collection, so callers that only need the primary definition didn’t need to change:

// Backward-compatible helper
public function getMeaning(): string
{
    return $this->meanings->first()
        ?->getDefinition()
        ?? '';
}

// New: get all meanings for the iOS detail view
public function getMeanings(): Collection
{
    return $this->meanings->matching(
        Criteria::create()->orderBy(['position' => 'ASC'])
    );
}

The Admin Panel

The admin panel is
server-rendered Twig
with a full dark-mode
redesign from b5161b8.
No React, no Vue —
just Symfony and a
healthy respect for
HTML forms.

Commit e904c16 added per-page vocabulary management to the admin: view all words on a page, reorder them, add new entries, set captions. The mockup below gives you an interactive feel for how it works. Click any row to expand the edit form.

admin.kotoba.app / manga / yotsubato / vol-1 / page / 3

📄 Page 3 vocabulary

#	Word	Meaning	Page

Excluding Known Words from Frequency (`7324008`)

The SQL subquery that
excludes known words is
deceptively simple but
it’s the feature users
noticed most — “it only
shows me words I don’t
already know!”

The GET /api/vocabulary/common endpoint returns the most frequent vocabulary items across a manga series — but for logged-in users, it should skip words they already know. That’s a one-line SQL addition that makes the whole product feel personalised:

-- Core query for common vocabulary, unknown to this user
SELECT
    v.id,
    v.text,
    v.reading,
    COUNT(DISTINCT pv.page_id) AS occurrences
FROM vocabulary v
JOIN page_vocabulary pv ON pv.vocabulary_id = v.id
JOIN page p            ON p.id = pv.page_id
JOIN volume vol        ON vol.id = p.volume_id
WHERE vol.manga_id = :mangaId
  AND v.id NOT IN (
      SELECT kw.vocabulary_id
      FROM known_word kw
      WHERE kw.user_id = :userId   -- the magic line
  )
GROUP BY v.id
ORDER BY occurrences DESC
LIMIT 50;

The interactive chart below shows vocabulary frequency across a mock manga series. Toggle between “Show all” and “Exclude known” to see how the known words disappear.

📊 Word frequency across manga pages

Unknown to you

Already known (faded in "exclude" mode)

What’s Next

Part 3 is where things
got painful in a very
instructive way. The
N+1 query pattern
shows up when you’re
not looking, and the
fix changes how you
think about ORMs.

The vocabulary engine works. You can scrape a series, run a folder of images through Gemini, import the resulting CSV, and have a fully populated database in minutes. The admin panel lets you review and fix anything the AI got wrong.

But as soon as I added real users and real reading sessions, the API started slowing down. Pages that should load in 100ms were taking 2–3 seconds. The culprit was the N+1 query pattern — hiding in places I didn’t expect.

Part 3 is about finding every N+1, fixing them with DQL joins and EXTRA_LAZY fetch modes, and what I learned about Doctrine’s behaviour that I wish someone had told me earlier.

See you in the next one.

The Data Problem

The Scraper Architecture

⚙️ Scraping pipeline

The Scraper Code

The duplicate-key bug (5a22fa8)

The Pivot: AI Extraction with Gemini

The CSV Import Pipeline

📂 CSV import simulator

The Batch-Loading Fix (792920c)

⏱ Import performance: before vs after

Vocabulary Meaning Entities (e5970a3)

🗂 Entity relationships

The Admin Panel

➕ Add vocabulary

Excluding Known Words from Frequency (7324008)

📊 Word frequency across manga pages

What’s Next

The duplicate-key bug (`5a22fa8`)

The Batch-Loading Fix (`792920c`)

Vocabulary Meaning Entities (`e5970a3`)

Excluding Known Words from Frequency (`7324008`)