Back to learn
PDF technical note

PDF internals without the mystery.

A practical explanation of XREF tables, object maps, streams, and repair logic for people who need PDFs to open, merge, split, and survive export.

7 min readRepair logicPDF structure

What the XREF table does

The XREF table is the map of a PDF. It tells a reader where each object starts in the file: pages, fonts, images, annotations, metadata, and document structure.

When those byte offsets are wrong, the PDF may open slowly, display missing pages, fail to merge, or show a damaged-file warning.

Anatomy of an entry

A classic XREF entry stores an object offset, a generation number, and whether the object is active or free.

xref
0 1
0000000000 65535 f
14 2
0000000014 00000 n
0000000088 00000 n

Why repair works

A repair workflow scans the binary stream, finds object headers, rebuilds the map, and writes a fresh cross-reference section so readers can locate objects again.

Why PDFs fail

  • Interrupted downloads: The file ends before the final table is complete.
  • Bad incremental saves: New objects are appended, but the final pointer is wrong.
  • Broken generators: Some scanners and export tools write non-standard structures.
  • Damaged streams: Compressed object data exists, but cannot be decoded cleanly.

XREF streams

Modern PDFs often compress cross-reference data into XREF streams. This saves space but makes manual inspection harder. A repair engine needs to understand both old tables and newer stream-based references.

What to check before export

  1. Open the repaired file in a normal PDF reader.
  2. Check the first, middle, and last pages.
  3. Confirm bookmarks and annotations still behave as expected.
  4. Keep the original file until the repaired output is verified.