« Script Search in MBS … | Home | Introducing Matrix.MB… »

DynaPDF Content Parser

As PDF documents contain pages with content streams, you may be interested to inspect the content with our DynaPDFParserMBS class. You open a PDF document, you import pages into memory and then parse the page. Once you parse them page, you can access the content objects. That's great for a few things:

  • Extract text or vector graphics
  • Remove unwanted elements
  • Modify drawings
  • Get bounding boxes and coordinates from every item.
  • Check which font is active for which text fragment.

A lot of properties in the classes are settable, so you can for example change a color easily. Or adjust a coordinate in some vector graphic or adjust the line width.

Or when you like to place a template on top of an existing page, you may need to modify the content to remove rectangle in the background, so you can see through the template to the content behind it.

Here is a sample, that marks all images for deletion and then writes the page back.

Example
// remove all images
Var needWrite As Boolean
Var ContentParsingFlags As Integer = 0

If parser.ParsePage(page, ContentParsingFlags) Then

var u as integer = parser.OperatorCount-1
for j as integer = 0 to u
Var content As DynaPDFParserContentMBS = parser.Content(j)

if content.Operator = DynaPDFParserContentMBS.kopDrawImage then
content.Delete
needWrite = True
end if
next

If needWrite Then
Call Parser.WriteToPage
End If
End If

Here is the list of operators and the classes used for the operators. By default you get a DynaPDFParserContentMBS object. If the operator is one with parameters, we use the matching subclass, so you can access the properties.

Constant Value Description Class
kopBeginCompatibility 1 Ignore unknown operators until the section is terminated with kopEndCompatibility.
no parameters
kopBeginMarkedContent 2 Begins marked content.
DynaPDFParserContentBeginMarkedContentMBS
kopBeginText 3 Begins text.
no parameters
kopClipPath 4 Clip current path.
DynaPDFParserContentClipPathMBS
kopClipPathExt 5 Clip path with extended options.
DynaPDFParserContentClipPathExtMBS
kopDrawImage 6 Draw an image.
DynaPDFParserContentDrawImageMBS
kopDrawInlineImage 7 Draw an inline image.
DynaPDFParserContentDrawInlineImageMBS
kopDrawPath 8 Draw a path.
DynaPDFParserContentDrawPathMBS
kopDrawPathExt 9 Draw a path with more options.
DynaPDFParserContentDrawPathExtMBS
kopDrawShading 10 Draw shading.
DynaPDFParserContentDrawShadingMBS
kopDrawTemplate 11 Draw a template.
DynaPDFParserContentDrawTemplateMBS
kopDrawTranspGroup 12 Draw a transparent group.
DynaPDFParserContentDrawGroupMBS
kopEndCompatibility 13 Compatibility section ends.
no parameters
kopEndMarkedContent 14 End marked content.
no parameters
kopEndText 15 End text.
no parameters
kopInitType3Glyph0 16 Init 3D Glyph
DynaPDFParserContentInitType3GlyphMBS
kopInitType3Glyph1 17 Init 3D Glyph
DynaPDFParserContentInitType3GlyphMBS
kopInsertPostscript 18 Insert PostScript. Can be considered when printing on a Postscript device.
DynaPDFParserContentInsertPostscriptMBS
kopMarkedContPoint 19 Marked content point.
DynaPDFParserContentMarkedContPntMBS
kopMulMatrix 20 Multiply matrix.
DynaPDFParserContentMulMatrixMBS
kopNull 0 This represents a deleted node.
none
kopPageHeader 21 Page Header
DynaPDFParserContentPageHeaderMBS
kopRestoreGS 22 Restore Graphics State
no parameters
kopSaveGS 23 Save Graphics State
no parameters
kopSetCharSpacing 24 Set character spacing
DynaPDFParserContentFloatMBS
kopSetExtGState 25 Set extended graphics state.
DynaPDFParserContentExtGStateMBS
kopSetFillColor 26 Set fill color.
DynaPDFParserContentColorMBS
kopSetFillColorSpace 27 Set fill color space.
DynaPDFParserContentColorSpaceMBS
kopSetFillPattern 28 Set fill pattern.
DynaPDFParserContentPatternMBS
kopSetFlatnessTolerance 29 Set flatness tolerance.
DynaPDFParserContentFloatMBS
kopSetFont 30 Set font
DynaPDFParserContentFontMBS
kopSetLineCapStyle 31 Set line cap style.
DynaPDFParserContentIntMBS
kopSetLineDashPattern 32 Set line dash pattern.
DynaPDFParserContentLineDashPatternMBS
kopSetLineJoinStyle 33 Set line join style.
DynaPDFParserContentIntMBS
kopSetLineWidth 34 Set line width.
DynaPDFParserContentFloatMBS
kopSetMiterLimit 35 Set miter limit.
DynaPDFParserContentFloatMBS
kopSetRenderingIntent 36 Set rendering intent.
DynaPDFParserContentIntMBS
kopSetStrokeColor 37 Set stroke color.
DynaPDFParserContentColorMBS
kopSetStrokeColorSpace 38 Set stroke color space.
DynaPDFParserContentColorSpaceMBS
kopSetStrokePattern 39 Set stroke pattern.
DynaPDFParserContentPatternMBS
kopSetTextDrawMode 40 Set text drawing mode.
DynaPDFParserContentIntMBS
kopSetTextScale 41 Set text scale.
DynaPDFParserContentFloatMBS
kopSetWordSpacing 42 Set word spacing.
DynaPDFParserContentFloatMBS
kopShowText 43 Shows text.
DynaPDFParserContentShowTextMBS

Here is a sample where we check whether a content object is a DynaPDFParserContentDrawImageMBS object, so we can assign it to such a variable and access properties:

Example
Var pdf As New DynapdfMBS
// ... load some PDF

Var Parser As New DynaPDFParserMBS(pdf)
Var ContentParsingFlags As Integer = 0
Var page As Integer = 0

If parser.ParsePage(page, ContentParsingFlags) Then

Var u As Integer = parser.OperatorCount-1
For j As Integer = 0 To u
Var content As variant = parser.Content(j)
if content isa DynaPDFParserContentDrawImageMBS then
var ContentDrawImage as DynaPDFParserContentDrawImageMBS = content

MessageBox "Image "+ContentDrawImage.ImageHandle.ToString+" is on the page"
End If
Next
End If

Please try this. You may enjoy walking all the content of the pages in your PDF documents and make interesting adjustments.

07 05 26 - 07:08