PDF Source

Overview

The PDF Source Component is an SSIS Data Flow Component for consuming tabular data from PDF files. The component detects tables in the PDF file and allows processing a single table or multiple consecutive tables (across several pages), assuming they have the same structure. All output columns are of data type DT_WSTR.

Demonstration

Quick Start

Consume tabular data from PDF File

In this section we will show you how to set up a PDF Source component.

For this scenario will will read data for some items from an invoice in PDF format

In the SSIS Toolbox, locate the COZYROC's PDF Source component and drag it onto the Data Flow canvas.
Double click it to open it's editor.
Choose the location of the PDF file and specify the following parameters to describe its tables structure and which table to process (as there are multiple tables).

When clicking on Columns tab the component would prepare the output and external columns by analyzing the data in the PDF.

Click "Preview" to verify that the data is read correctly.

Optionally specify settings of "Error Output"
Click OK to close the component editor.

Congratulations! You have successfully configured the PDF Source component.

Contribute

Parameters

General

Use the General page of the PDF Source dialog to specify the source PDF file and settings which table to process and how to do it.

Connection

Select a file via a standard FILE connection manager.
Password

Specify PDF file password if necessary
ContainsColumnNames

Specify whether the PDF table to be processed has a header row with column names
MergeTables

Specify whether consecutive tables need to be treated as one. That's useful for table spanning across several pages. Only if the number of columns are the same, the table will be "merged", i.e. processed like a single table.

TableFindType

2.1 SR-1

Select how to locate a table in the PDF document This property has the options listed in the following table.

TableFindType	Description
Index	Locate a table by its zero-based index (default strategy).
RowRegex	Locate a table by a regular expression on a row representation, where the row values are comma-separated
IndexAndRowRegex	Locate a table by index and then locate its first row by a regular expression.

TableFind

2.1 SR-1

Specify the PDF table location strategy criteria, according to TableFindType

TableFindType	TableFind
Index	A zero-based index (e.g. "0").
RowRegex	A regular expression to match the first row across all tables in the PDF document (e.g. `^#` would match the first row that starts with `#`)
IndexAndRowRegex	A regular expression to match the first row with a specified table in the PDF document (e.g. `1\|^#` would match the first row in the second table that starts with `#`).

LastRowsToSkip

Specifies how many rows to skip at the end of the table. Useful, mainly when there is a summary row(s) at the end.

SkipIncompleteRows

2.1 SR-1

Specifies whether to skip rows in a table that have less values than the columns of the table. Sometimes that's an indication that the rows don't really belong to the table (in case the parsing of the PDF has not been very precise about part of the content):

Value	Description
None	Don't skip incomplete rows (pad with NULL values, instead).
Bottom Rows	Skips incomplete rows only at the bottom of the table (default)
All	Skips all incomplete rows.

Knowledge Base

Where can I find the documentation for the PDF Source?

What's New

2.1 SR-1

New: A new parameter 'Skip incomplete data rows'.
New: Find table by index, regex or both. (replace TableIndex with TableFindBy and TableFind)

2.1

New: Introduced component.

SSIS+ Components Suite

SSIS NoW

Excel Add-in for SAS

Overview

Demonstration

Quick Start

Parameters

General

Knowledge Base

What's New

Newsletter

Contact Us

Follow Us

Support

SSIS+ Components Suite

SSIS NoW

Excel Add-in for SAS

Overview

Demonstration

Quick Start

Parameters

General

Knowledge Base

What's New

Related documentation