
131
11
Working with Words
This chapter explains how to use the Acrobat core API to search for words, extract and display words, and
highlight words. Using the Acrobat core API, you can, for example, create application logic that extracts
words from a PDF document and places each word in a repository.
This chapter contains the following information.
About searching for words
The Acrobat core API provides typedefs and methods for working with words. Two primary typedefs that
you will use when working with words located in a PDF document are
PDWord
and
PDWordFinder
. The
following are two word-finding indicators:
●
Presence of non-alphanumeric characters such as dashes.
●
Offsets between characters. (While character offsets are well-defined quantities in a PDF file, word
numbers are calculated by the Acrobat or Adobe Reader word finder algorithm).
About PDWord typedefs
A
PDWord
object represents a word in a PDF file. Each word contains a sequence of characters in one or
more styles. All characters in a word are not necessarily physically adjacent. For example, words can be
hyphenated across line breaks on a page.
Each character in a word has a character type. Character types include: control code, lowercase letter,
uppercase letter, digit, punctuation mark, hyphen, soft hyphen, ligature, white space, comma, period,
unmapped glyph, end-of-phrase glyph, wildcard, word break, and glyphs that cannot be represented in
the destination font encoding. For information about character types, see the
Acrobat and PDF Library API
Reference
.
The
PDWordGetCharacterTypes
method can get the character type for each character in a word. The
PDWordGetAttr
method returns a mask containing information on the types of characters in a word. The
mask is the logical OR of several flags, including the following:
●
One or more characters in the word cannot be represented in the output encoding.
●
One or more characters in the word are punctuation marks.
●
The first character in the word is a punctuation mark.
●
The last character in the word is a punctuation mark.
Topic
Description
See
About searching for words
Describes searching for words.
page 131
Creating a PDWordFinder object
Describes how to create a
PDWordFinder
object.
page 132
Extracting and displaying words
Describes how to extract and display words.
page 134
Highlighting words
Describes how to highlight words.
page 136