Manga Kotoba, Part 2: Scraping, AI Extraction, and the Vocabulary Engine
This is Part 2 of 5.
If you missed Part 1,
it covers the overall
architecture and the
iOS app design.
An app that shows vocabulary is useless without vocabulary data. That sounds obvious, but I didn’t fully appreciate it until I had a working Symfony API, a polished SwiftUI frontend, and absolutely nothing to put in the database. Part 2 is about solving that problem — and the three increasingly-interesting approaches I took to do it.
The short version: I tried manual entry, I tried scraping, and I ended up building an AI extraction pipeline on top of Gemini’s vision API. Each approach taught me something, and the final system is a combination of all three working together.
The Data Problem
A typical manga volume
has around 180 pages.
Each page can have
20–40 vocabulary items.
That’s ~5,000 words
per volume — by hand,
completely impossible.
The vocabulary that powers Manga Kotoba comes from annotated manga pages. Each page has a list of words with their readings and meanings. The iOS app surfaces those words while you read, lets you mark them as known, and tracks your progress across volumes.
Three approaches, in order of how naive I was being at the time:
- Manual entry — I actually tried this for about four hours. I got through twelve pages of one volume. At that rate it would take me approximately six months to import a single series. Hard no.
- Web scraping — there are a handful of sites that annotate manga vocabulary. If I could parse their HTML, I’d have data instantly. This worked, kind of, and I’ll show you the code.
- AI extraction from images — give Gemini a manga page image, get back a structured CSV. This turned out to be the real answer.
The Scraper Architecture
KotobaScraper uses
Symfony’s HttpClient
and DomCrawler. No
Puppeteer, no headless
browser — the source
sites render server-side
HTML, which helps.
Click any stage in the pipeline below to see the relevant code.
⚙️ Scraping pipeline
The Scraper Code
The scraper is a single KotobaScraper class that walks the site’s URL hierarchy:
series → volumes → paginated pages → word rows. Commit d18ab71 has the full parser.
// src/Scraper/KotobaScraper.php (simplified) final class KotobaScraper { private const BASE_URL = 'https://manga-kotoba.com'; public function __construct( private readonly HttpClientInterface $httpClient, ) {} public function scrapeSeries(string $seriesUrl, int $maxVolumes = 0): SeriesData { $html = $this->fetch($seriesUrl); $crawler = new Crawler($html, $seriesUrl); $title = trim($crawler->filter('h1')->first()->text('')); $author = null; // Author link is inside #series-infobox, starts with /by/ $crawler->filter('#series-infobox div > a')->each( function (Crawler $node) use (&$author) { if (str_starts_with($node->attr('href') ?? '', '/by/')) { $author = trim($node->text('')); } } ); $volumeSlugs = []; $crawler->filter('#series-volume-stats tbody tr td a')->each( function (Crawler $node) use (&$volumeSlugs) { $href = $node->attr('href') ?? ''; if (str_contains($href, '/volume/')) { $volumeSlugs[] = $this->slugFromUrl(self::BASE_URL . $href, 'volume'); } } ); if ($maxVolumes > 0) { $volumeSlugs = array_slice($volumeSlugs, 0, $maxVolumes); } $volumes = array_map(fn($slug) => $this->scrapeVolume($slug), $volumeSlugs); return new SeriesData($title, slugify($title), $author, ..., $volumes); } }
The duplicate-key bug (5a22fa8)
Doctrine’s identity map
only tracks flushed
entities. Unflushed
new entities are
invisible to find()
— which is what bit me
here.
This was a fun one. The scraper called findOrCreateVocabulary($sourceId) for each word.
When the same word appeared on two different manga pages in a single scrape run,
the second call to find() returned null — because the first entity had been staged
in Doctrine’s unit of work but not yet flushed to the database. Two new Vocabulary objects
with the same source_id ended up in the same flush, and the UNIQUE constraint exploded.
The fix was a local $vocabCache array keyed by sourceId, passed by reference through
the persistence loop. Three-level lookup: in-memory cache first, then Doctrine identity map,
then a real DB query.
// Three-level lookup — no more duplicate-key errors private function findOrCreate(WordData $data, array &$cache): Vocabulary { // 1. in-memory cache (same scrape session) if (isset($cache[$data->sourceId])) { return $cache[$data->sourceId]; } // 2. Doctrine identity map / DB $vocab = $this->vocabRepo->findOneBy(['sourceId' => $data->sourceId]); // 3. create new entity if ($vocab === null) { $vocab = Vocabulary::fromWordData($data); $this->em->persist($vocab); } return $cache[$data->sourceId] = $vocab; }
The Pivot: AI Extraction with Gemini
Gemini’s vision API is
surprisingly good at
reading manga — even
handwritten speech
bubbles in messy fonts.
The scraper worked well for the one site that had clean, parseable HTML. The problem: every other vocabulary source had different markup, different pagination patterns, different CSS selectors. Maintaining five separate HTML parsers felt like a maintenance nightmare.
The real answer was simpler: manga pages are images. And Gemini 2.5 Flash can read images.
Instead of scraping vocabulary from annotated HTML, I could take the original manga page scans, send them to Gemini, and ask it to extract the vocabulary directly. The model sees the speech bubbles, the furigana, the sound effects — and returns a structured CSV.
Here’s the prompt that ended up in AiProcessMangaCommand.php after three iterations:
// The final prompt (commit 653d10d) $prompt = "Extract all Japanese vocabulary from this manga page." . " Use the dictionary form. Avoid particles and intensifiers." . " Format output as CSV (no header). Two row types:\n" . "\n" . " Metadata row: <PAGE_NUM>,caption,notes,,,\n" . " Vocabulary row: <PAGE_NUM>,,,word,reading,position\n" . "\n" . "Caption values: Cover, Index, Extra, Chapter 1 ...\n" . "Only return CSV lines. No markdown. Dictionary form always.";
And an example of the output Gemini produces:
# 6-column CSV output from Gemini 3,Chapter 1,Weather and everyday expressions,,, 3,,,湿気,しっけ,1 3,,,蒸し暑い,むしあつい,2 3,,,大丈夫,だいじょうぶ,3 4,Chapter 1,,,, 4,,,外,そと,1 4,,,遊ぶ,あそぶ,2
The command walks a folder of numbered images (1.jpg, 2.jpg, …), processes each
through Gemini’s generateContent endpoint with the image base64-encoded, and streams
the results into a single import.csv:
$payload = [ 'contents' => [[ 'parts' => [ ['text' => $prompt], [ 'inline_data' => [ 'mime_type' => $mimeType, 'data' => base64_encode(file_get_contents($imagePath)), ], ], ], ]], ]; $response = $this->httpClient->request('POST', self::GEMINI_API_URL, [ 'query' => ['key' => $this->googleApiKey], 'json' => $payload, 'timeout' => 60, ]); $text = $response->toArray()['candidates'][0]['content']['parts'][0]['text'];
The CSV Import Pipeline
CsvImportService is
the bridge between
the AI-generated CSV
and the Doctrine
entities. It auto-detects
4-column vs 6-column
format.
Once you have the CSV, you import it into a volume through the admin panel. The demo below simulates what the import process looks like for a small file. Hit Import and watch each row get processed.
📂 CSV import simulator
The CsvImportService in PHP mirrors exactly what the demo shows. It reads the file,
validates each row, checks for duplicates, and persists via Doctrine:
// src/Service/CsvImportService.php — validation loop (simplified) while (($data = fgetcsv($csvHandle, null, ',', '"', '\\')) !== false) { $line++; // Skip header if present if ($line === 1 && strtolower((string) ($data[0] ?? '')) === 'page') { continue; } // Auto-detect 6-col vs 4-col format $is6Col = count($data) >= 6; $pageNum = (int) $data[0]; $text = $is6Col ? ($data[3] ?? '') : ($data[1] ?? ''); $reading = $is6Col ? ($data[4] ?? '') : ($data[2] ?? ''); // Metadata rows have empty word column if (trim($text) === '') { $this->handleMetadataRow($data, $volume); continue; } $rows[] = [$pageNum, $text, $reading, (int) $data[5] ?? 0, $line]; $neededTexts[] = $text; $neededPageNums[] = $pageNum; }
The Batch-Loading Fix (792920c)
Batch flushing is
Doctrine 101, but it’s
the kind of thing you
only learn after
watching a 200-row
import take 45 seconds
and timing out on
Heroku.
The original import worked, but it was slow. Embarrassingly slow. Each row triggered its own
findVocabulary() DB query and its own $em->flush(). For a 1,000-row CSV that meant
~2,000 SQL queries. On Heroku, that reliably hit the H12 30-second timeout.
The fix was a 4-pass batch approach:
- Pass 1 — parse the entire CSV into memory (no DB)
- Pass 2 — create all missing pages in one flush
- Pass 3 — batch-load all vocabulary with a single
SELECT ... WHERE text IN (...) - Pass 4 — assemble INSERT tuples and fire one
INSERT IGNORE
⏱ Import performance: before vs after
The key change — going from N individual flushes to a single batched INSERT:
// BEFORE — one flush per row foreach ($rows as $row) { $vocab = $this->findVocabulary($row['text'], $row['reading']); // 1 SELECT each $this->em->persist($vocab); $this->em->flush(); // 1 INSERT each — 1000 rows = 2000 queries } // AFTER — batch select + single INSERT $vocabMap = $this->batchLoadVocabulary($uniqueTexts); // 1 SELECT $tuples = []; foreach ($rows as $row) { $vocabId = $vocabMap[$row['text']]?->getId(); $pageId = $pages[$row['page']]; $tuples[] = '(' . $pageId . ',' . $vocabId . ',' . $row['pos'] . ')'; } // One INSERT IGNORE for the whole file $this->em->getConnection()->executeStatement( 'INSERT IGNORE INTO page_vocabulary (page_id, vocabulary_id, position) VALUES ' . implode(',', $tuples) );
Vocabulary Meaning Entities (e5970a3)
A word like 大丈夫 has
multiple senses: “OK”,
“safe”, “certainly”.
Collapsing them into
a single text column
made filtering by
part-of-speech
impossible.
The original schema had a single meaning TEXT column on both Vocabulary and Kanji.
That worked fine until I needed to filter vocabulary by part of speech, or display individual
senses in the iOS app. Commit e5970a3 replaced both columns with proper one-to-many
relationships — VocabularyMeaning and KanjiMeaning entities with a position field
for ordering.
🗂 Entity relationships
A backward-compatible getMeaning() helper on Vocabulary returns the first meaning from
the collection, so callers that only need the primary definition didn’t need to change:
// Backward-compatible helper public function getMeaning(): string { return $this->meanings->first() ?->getDefinition() ?? ''; } // New: get all meanings for the iOS detail view public function getMeanings(): Collection { return $this->meanings->matching( Criteria::create()->orderBy(['position' => 'ASC']) ); }
The Admin Panel
The admin panel is
server-rendered Twig
with a full dark-mode
redesign from b5161b8.
No React, no Vue —
just Symfony and a
healthy respect for
HTML forms.
Commit e904c16 added per-page vocabulary management to the admin: view all words on a page,
reorder them, add new entries, set captions. The mockup below gives you an interactive feel
for how it works. Click any row to expand the edit form.
| # | Word | Meaning | Page |
|---|
Excluding Known Words from Frequency (7324008)
The SQL subquery that
excludes known words is
deceptively simple but
it’s the feature users
noticed most — “it only
shows me words I don’t
already know!”
The GET /api/vocabulary/common endpoint returns the most frequent vocabulary items across
a manga series — but for logged-in users, it should skip words they already know.
That’s a one-line SQL addition that makes the whole product feel personalised:
-- Core query for common vocabulary, unknown to this user SELECT v.id, v.text, v.reading, COUNT(DISTINCT pv.page_id) AS occurrences FROM vocabulary v JOIN page_vocabulary pv ON pv.vocabulary_id = v.id JOIN page p ON p.id = pv.page_id JOIN volume vol ON vol.id = p.volume_id WHERE vol.manga_id = :mangaId AND v.id NOT IN ( SELECT kw.vocabulary_id FROM known_word kw WHERE kw.user_id = :userId -- the magic line ) GROUP BY v.id ORDER BY occurrences DESC LIMIT 50;
The interactive chart below shows vocabulary frequency across a mock manga series. Toggle between “Show all” and “Exclude known” to see how the known words disappear.
📊 Word frequency across manga pages
What’s Next
Part 3 is where things
got painful in a very
instructive way. The
N+1 query pattern
shows up when you’re
not looking, and the
fix changes how you
think about ORMs.
The vocabulary engine works. You can scrape a series, run a folder of images through Gemini, import the resulting CSV, and have a fully populated database in minutes. The admin panel lets you review and fix anything the AI got wrong.
But as soon as I added real users and real reading sessions, the API started slowing down. Pages that should load in 100ms were taking 2–3 seconds. The culprit was the N+1 query pattern — hiding in places I didn’t expect.
Part 3 is about finding every N+1, fixing them with DQL joins and EXTRA_LAZY fetch modes,
and what I learned about Doctrine’s behaviour that I wish someone had told me earlier.
See you in the next one.