About the course
The PDF Automation in UiPath Studio course is for those with existing knowledge of RPA. You will learn more about UiPath Studio and PDF data extraction with UiPath. Through video tutorials, you will learn about the types of PDF documents and how to install the UiPath PDF Activities package. We teach about UiPath PDF scraping and how to extract data from blocks of text and tables in PDFs using UiPath Studio.
What you will learn in this course
At the end of this course you should be able to:
- Install the UiPath PDF Activities Package.
- Use different activities in Studio to extract large pieces of text from the PDF files.
- Extract a single piece of information from a PDF document.
- Use different activities in Studio to extract data from multiple PDFs with similar structures.
- Use the UI automation capabilities of Studio to extract fluctuating values from multiple files with the same structure.
Change Adobe Reader DC Acrobat Settings
If the PDF is opened with Adobe Reader DC Acrobat, there might be a few steps to take before you can extract specific elements using UiPath Studio methods.
- Start Acrobat and press Ctrl+K. That opens the Preferences pop-up.
- Select Reading, out of the categories on the left panel.
- Check that the drop-down Reading Order option is set to the Acrobat recommended option, ‘Infer reading order from the document (recommended)’.
- ‘Page vs. Document’: should be set to ‘Read the entire document’ and ‘Confirm before tagging documents’ should be unchecked.
- Then on the left panel, click Accessibility. In the Other Accessibility Options section, check the first two boxes if they are not already checked: ‘Use document structure for tab order when no explicit tab order is specified’, ‘Enable assistive technology support’.
- Click OK.
Extracting Data from PDF
There are two types of PDF files: native and scanned.
A PDF file that is originally generated in a computer, aka “born digital”, meaning that it was created from an original electronic version of a document. One quick way to tell that a PDF is native is that you can select blocks of text in the file.
A PDF file is made up of scanned images of a given document. With scanned PDFs, you will not be able to select text or use the search function because the PDF is a collection of images.
In PDF automation, data can be extracted using two separate activities:
- Extract data using the Read PDF Text activity.
- Extract data using Read PDF with OCR activity.
Read PDF Text is the more accurate of the two but works only with native PDF documents. Read PDF with OCR is less reliable, but can extract text from scanned PDF documents. The second option also requires the selection of an OCR engine. Both activities are available in the UiPath.PDF.Activities package.
Extracting data from PDF files
- Before we start working with PDF files, we should make sure that we have the UiPath PDF Activities Package installed.
- It’s very important to make the distinction between digital text and scanned text.
- Some of the options we have when working with digital text are: Get PDF Text, Get Full Text, Get Native Text.
- When working with scanned text we can use OCR-based activities.
- OCR activities require OCR Engines.
Extracting a Single Piece of Data From PDF
- The Get Text activity can be used to extract a single piece of digital text from a UI Element in a PDF file.
- We can add the activity from the Activities panel or by using the Recorder.
- If we want to iterate through multiple files and extract the same piece of information, we often need to fine-tune the selectors to make them less specific.
Extracting Data Using Anchor Base
The Anchor Base activity is used to identify an element with an unstable selector relative to an element with a stable one. For example we may want to get a value from a PDF invoice. The value UI element has an unstable selector while the label element is stable.
This activity is made up of two blocks, as it performs an action in relation to another fixed element or anchor:
- The Anchor block: Supports only the Find Element or Find Image activities. Identifies the UI element to be used as an anchor.
- The Action block: Supports UI interaction activities for the target element. Most often for PDF, we will use the Get Text activity to retrieve the text in the target UI element.
Let’s now watch a video to understand how we can use the Anchor Base activity to extract data.
If the PDF activities are not listed in your Activities Panel, how can you get them?
Answer: By installing them using the Manage Packages feature.
What is the easiest way to get the invoice number from a native PDF file?
Answer: Open the PDF file with Adobe Acrobat Reader and scrape only the relevant information.
How can a robot read only the first page of a PDF file, using the PDF activities?
Answer: Set the Range property to: “1”
If you want to extract specific information from a series of PDF files with a similar structure but the workflow only works for one file of the series, what should you investigate?
Answer: The Selector property.