Using DynaPDF parser to find characters

Please check out the DynaPDFParserMBS class in MBS Xojo DynaPDF Plugin. This class allows you to:

Parse a page

Extract text

Find text

Replace text

Find characters

Delete text

Write changes back to page

You can limit the search to a part of the page or the whole page and use various options like whether the text search is case insensitive.

Today we want to show you how you can identify the exact position of any character in a PDF. Like this picture where we show all characters with a box, even for mirrored or rotated text:

Let us show the code for this. You may review the example project Text Positions with parser and see where we load the PDF. Once it is loaded, we initialize the DynaPDFParserMBS object. We use the kstMatchAlways here to have it not look for a particular text, but to report the position of every character:

// now do search and replace Dim Parser As New DynaPDFParserMBS(p) Dim area As DynaPDFRectMBS = Nil // whole page Dim SearchType As Integer = DynaPDFParserMBS.kstMatchAlways Dim ContentParsingFlags As Integer = DynaPDFParserMBS.kcpfEnableTextSelection If parser.ParsePage(1, ContentParsingFlags) Then Dim index As Integer = 0 Dim found As Boolean = Parser.FindText(area, SearchType, "") While found Dim r As DynaPDFRectMBS = parser.SelBBox Dim t As New PDFText t.Text = parser.SelText t.rect = r t.index = index t.points = parser.SelBBox2 texts.Append t index = index + 1 found = Parser.FindText(area, SearchType, "", True) Wend End If

The loop runs while we have more text. For each character, we get the selection text and the bounding box as an array of points. You can of course just get the rectangle, but that won't handle rotated text. We continue the loop with calling FindText again and passing true to continue search.

In the paint event of the window, we draw the PDF page first. Then we loop over the found text pieces and show each character surrounded with the box drawn from the points we got:

For Each t As PDFText In texts Dim points() As DynaPDFPointMBS = t.points g.ForeColor = &c00FF00 g.DrawLine points(0).X * factor, points(0).Y * factor, points(1).X * factor, points(1).Y * factor g.DrawLine points(1).X * factor, points(1).Y * factor, points(2).X * factor, points(2).Y * factor g.DrawLine points(2).X * factor, points(2).Y * factor, points(3).X * factor, points(3).Y * factor g.DrawLine points(3).X * factor, points(3).Y * factor, points(0).X * factor, points(0).Y * factor next

As shown you can know from each character where it is. You may use DeleteText function to precisely cut text and remove individual characters from the PDF page. Or annotate the PDF page. Like you could add WebLinks to specific words once you know the surrounding rectangle.

Please try the example project and let us know what questions you have. The recent addition of SelBBOx2 and SelText properties in v24.1 are based on customers asking for them.