Understanding PDF Structure and Content Streams
PDF files look simple on the surface, but internally they are highly structured documents built from objects, streams, and drawing instructions. If you're working with tools like DynaPDF's parser functions, understanding how PDFs are organized is essential.
1. The High-Level Structure of a PDF
A PDF file consists of four main parts:
- Header – Defines the PDF version (e.g., %PDF-1.7)
- Body – Contains all objects (pages, fonts, images, etc.)
- Cross-reference table (xref) – Maps object locations
- Trailer – Points to the root object and metadata
Everything in a PDF is stored as an object, identified by an object number and generation number.
2. Objects in a PDF
Objects are the building blocks of a PDF. Common object types include:
- Dictionaries
- Arrays
- Strings
- Numbers
- Streams
For example, a page itself is just a dictionary object referencing other objects:
<< /Type /Page /Parent 2 0 R /Contents 5 0 R /Resources 6 0 R >>
The important part here is /Contents — this is where the actual drawing instructions live.
3. What is a Content Stream?
A content stream is a special type of object that contains instructions describing how to render a page. These instructions are written in a compact, stack-based syntax similar to PostScript.
A content stream looks like this internally:
5 0 obj << /Length 44 >> stream 0 0 m 100 100 l S endstream endobj
This example draws a line from (0,0) to (100,100).
4. Operators Inside Content Streams
Content streams consist of operators and operands.
- m → MoveTo
- l → LineTo
- c → CurveTo
- re → Rectangle
- S → Stroke path
- f → Fill path
- Tj → Show text
Each operator modifies the drawing state or produces visible output.
Content streams are the heart of a PDF page. Everything visible—text, shapes, images—comes from these instructions.
By analyzing them, you can:
- Extract text or vector graphics
- Remove unwanted elements
- Modify drawings
- Rebuild page layouts
5. How DynaPDF Represents Content
When using DynaPDF.Parser.Content, these low-level instructions are converted into structured JSON. This makes it far easier to analyze or modify a page programmatically.
For example, a simple path might become:
{
"Operator": "DrawPath",
"OPNames": ["MoveTo", "LineTo"],
"Vertices": [
{ "x": 0, "y": 0 },
{ "x": 165, "y": 0.5 }
],
"Mode": 1,
...
}
Instead of parsing raw PDF syntax, you now work with clean data:
- Operator – High-level command
- Vertices – Geometry points
- Mode – Stroke/fill behavior
- OPNames – Underlying PDF operators
6. Editing Content Streams
With DynaPDF, the workflow typically looks like this:
- Parse the page with DynaPDF.Parser.ParsePage.
- Retrieve JSON via DynaPDF.Parser.Content, optionally filter operators (e.g., "DrawPath")
- Mark entries for deletion with DynaPDF.Parser.Delete
- Use DynaPDF.Parser.FindText and DynaPDF.Parser.ReplaceSelText function to search and replace.
- Write changes back to the page with DynaPDF.Parser.WriteToPage function.
This allows precise control over individual drawing commands instead of rewriting the entire document.
7. Mental Model: How a PDF Page is Rendered
Think of a PDF page like a script executed step-by-step:
- Set graphics state (color, line width, font)
- Define paths (MoveTo, LineTo, etc.)
- Draw them (stroke/fill)
- Render text
- Place images
Each instruction builds on the previous state, which is why order matters.
Conclusion
A PDF is not just a static document—it’s a sequence of drawing commands stored in structured objects. The content stream is where the real action happens, and tools like DynaPDF expose this layer in a developer-friendly way.
Once you understand content streams, manipulating PDFs becomes far more predictable and powerful.