Skip to content
meguri

Decide what to crawl next

meguri is the decision layer of a crawl stack. It absorbs an endless stream of discovered links, deduplicates them, schedules them politely, keeps them fresh on a cycle, and serializes the whole frontier to compact .meguri partitions that a fleet redistributes by moving files.

A crawler has three jobs: fetch pages, decide what to fetch next, and store what it found. meguri (巡, "to make the rounds and revisit on a cycle") is the middle one. It never opens a socket. It takes the links a fetcher discovered, decides which are worth crawling and in what order, holds back so a host is never hammered, and revisits pages on a schedule tuned to how often each one actually changes.

The state behind those decisions lives in .meguri files. A partition owns a range of hosts and serializes to exactly one file: a self-describing, columnar, checksummed container of every URL's and every host's crawl state. The same file is the engine's checkpoint, the unit a fleet redistributes by, and the cold archive.

Say you have a list of URLs from Common Crawl and you want a polite, ordered crawl plan out of it. One command builds a frontier; a second drains it; a third reads back what the partition holds:

ccrawl search '*.example.com/*' --limit 50000 -o jsonl | meguri seed -o frontier.meguri
meguri run -i frontier.meguri -o crawled.meguri
meguri inspect crawled.meguri

What it does

  • One file is a whole partition. A .meguri file holds the per-URL and per-host frontier state for one range of hosts, behind a CRC-checked header and a footer the reader finds from the tail. The tables are columnar and paged, so a tool reads a file's shape in two small reads no matter how large it is.
  • Deterministic on the byte. The same partition value always encodes to the same bytes. A checkpoint is diffable, a redistribution is verifiable, and a round trip is exact.
  • Polite by construction. A URL's identity is a 128-bit key whose high half is its host, so a host's URLs share a partition, a politeness bucket, and a contiguous range in the file. One in-flight fetch per host and per IP falls out of the layout.
  • Fresh on a cycle. Every fetch outcome updates a per-URL change-rate estimate, so a page that changes hourly is revisited often and one that never changes drifts to the back.
  • Pure Go, one binary. No cgo, no external queue, no database server. CGO_ENABLED=0 builds for every platform.

Where it fits

meguri is the scheduler between a fetcher and a store. A crawler like ami fetches the URLs meguri dispatches and returns outcomes; a corpus like Common Crawl through ccrawl-cli seeds the frontier and stands in for the live web in tests; a store and ranker like tsumugi consumes the crawled pages.

Where to go next

  • New here? Start with the introduction, then the quick start.
  • Want to install it? See installation.
  • Looking for a specific task? The guides cover seeding a frontier, running a crawl loop, serving a durable partition, rebalancing the files, and projecting the cost to fleet scale.
  • Curious what is inside a partition? The file format page documents the .meguri container, the CLI reference covers every command, and configuration lists the environment and on-disk layout.
Getting started Understand how meguri schedules a crawl, install the binary, and inspect your first .meguri partition. Guides Task-oriented walkthroughs for the things people do with meguri: seeding a frontier, running a crawl loop, serving a durable partition, rebalancing the files, and projecting the cost to fleet scale.