COZYROC Parquet components, part of COZYROC SSIS+ suite, are third-party plug-ins for Microsoft SSIS that make it easy to parse and generate Apache Parquet files. The toolkit is easy to use and follows the same guidelines and principles used by the standard out-of-the-box SSIS components.
The Apache Parquet integration package consists of a Parquet Source and Parquet Destination components that enable reading/generating Parquet files.
Parquet file schema
- Both the Source and the Destination components can deduce the data schema from the provided sample file. However, if for some reason a sample file is not available at the time the package is designed, the schema can be entered into the Destination editor in JSON format.
-
Each element of the schema is represented by two mandatory attributes:
name
,type
and one optionalfields
. Thetype
property can represent a primitive data type (possible types and their corresponding SSIS types are given in the table below) or complex types such as:struct
meaning nested object andlist
meaning collection (which can also consist of structure type objects). Only elements of a complex type such as astruct
orlist
have afields
attribute in which the description of nested objects is stored.
id
properties and a list of objects of type struct
containing properties: name
and country
as pictured below
can be presented with a schema string like this:
-
[{"Name" : "cities.list","Type" : "list","Fields" : [{"Name" : "cities.list.element","Type" : "struct","Fields" : [{"Name" : "cities.list.element.country","Type" : "string"}, {"Name" : "cities.list.element.name","Type" : "string"}]}]}, {"Name" : "id","Type" : "int64"}]
Parquet Data Type | SSIS Data Type |
---|---|
unspecified | DT_WSTR |
boolean | DT_BOOL |
byte | DT_UI1 |
signedbyte | DT_I1 |
short | DT_I2 |
int16 | DT_I2 |
unsignedint16 | DT_UI2 |
int32 | DT_I8 |
int64 | DT_I8 |
int96 | DT_I8 |
bytearray | DT_BYTES |
string | DT_WSTR |
float | DT_R4 |
double | DT_R8 |
decimal | DT_DECIMAL |
timestamp | DT_DBTIMESTAMPOFFSET |
time | DT_DBTIME2 |
date | DT_DATE |
interval | DT_BYTES |
Parquet Source
In this section we will show you how to set up a Parquet Source component.
- Double-click on the component on the canvas.
- Once the component editor opens, select the fIle connection from File Connection menu. If the file is in the correct format, its data schema will be visualized in the Schema text editor. Note that the schema can be entered manually in the editor and the corresponding metadata will be generated even without a file currently available.
- When clicking on Columns tab the component would prepare the outputs and external columns by analyzing the existing data in the Schema text editor. Please note that the Parquet Source can have multiple outputs (see the article about composite records), which columns you can see. The data in these outputs can be processed by downstream transformation and destination components(e.g. multiple OLE DB Destinations can store the data in SQL Server database).
- Click OK to close the component editor.
Congratulations! You have successfully configured the Parquet Source component.
Parquet Destination
In this section we will show you how to set up a Parquet Destination component.
- Double-click on the component on the canvas.
- Once the component editor opens, select the destination where generated JSON data will be stored: then provide Parquet sample file or directly write the schema into Schema text editor. You can also change the size of the groups of rows into which the parquet file will be divided internally.
.
- When clicking on Mapping tab the component would prepare the inputs and external columns by analyzing the scehma in the Schema text editor. Please note that the Parquet Destination can have multiple inputs (see the article about composite records), which columns you can see. The data in these inputs can be processed by upstream transformation and source components (e.g. a Query Transformation can be used to retrieve the necessary data from SQL Server database).
- Click OK to close the component editor.
Congratulations! You have successfully configured the Parquet Destination component.
Overview
Parquet Source Component is SSIS Data Flow Component for retrieving data from Apache Parquet file that supports multiple outputs via the composite records pattern.
- Supports reading the Apache Parquet files.
- Component metadata is automatically retrieved from the provided Parquet file.
- Supports the following Parquet sources: File and Variable.
- Supports composite outputs. Besides the root Parquet Source Output that contains the top-level fields, for any nested arrays, corresponding composite outputs get populated.
- Supports an error output for redirecting problematic records (in case of a failure processing the field values).
Quick Start
In this section we will show you how to set up a Parquet Source component.
- Double-click on the component on the canvas.
- Once the component editor opens, select the fIle connection from File Connection menu. If the file is in the correct format, its data schema will be visualized in the Schema text editor. Note that the schema can be entered manually in the editor and the corresponding metadata will be generated even without a file currently available.
- When clicking on Columns tab the component would prepare the outputs and external columns by analyzing the existing data in the Schema text editor. Please note that the Parquet Source can have multiple outputs (see the article about composite records), which columns you can see. The data in these outputs can be processed by downstream transformation and destination components(e.g. multiple OLE DB Destinations can store the data in SQL Server database).
- Click OK to close the component editor.
Congratulations! You have successfully configured the Parquet Source component.
Parameters
Configuration
Use the parameters below to configure the component.
Indicates the source of Parquet data. The following options are available:
Value Description File Select an existing File Connection Manager or create a new one. Variable The Parquet data is available in a variable. Select a variable or create a new one.
What's New
- New: Support for stream input.
- Fixed: Incorrect lower-case headers (Thank you, Romain).
- Fixed: Failed with error "System.IO.EndOfStreamException: Unable to read beyond the end of the stream." (Thank you, Naveen).
- New: Considerable performance improvements.
- Fixed: Data mismatch when reading from certain files (Thank you, Jessica).
- New: Introduced component.
Related documentation
Overview
Parquet Destination Component is SSIS Data Flow Component for generating Apache Parquet files.
- The component metadata is either automatically retrieved from a sample Parquet file or can be manually specified in JSON format.
- The generated Parquet file can contain nested arrays of objects following the composite records pattern), where the fields for the arrays are fed via separate inputs.
- The generated Parquet content can be written to a file or stored in a variable.
Quick Start
In this section we will show you how to set up a Parquet Destination component.
- Double-click on the component on the canvas.
- Once the component editor opens, select the destination where generated JSON data will be stored: then provide Parquet sample file or directly write the schema into Schema text editor. You can also change the size of the groups of rows into which the parquet file will be divided internally.
.
- When clicking on Mapping tab the component would prepare the inputs and external columns by analyzing the scehma in the Schema text editor. Please note that the Parquet Destination can have multiple inputs (see the article about composite records), which columns you can see. The data in these inputs can be processed by upstream transformation and source components (e.g. a Query Transformation can be used to retrieve the necessary data from SQL Server database).
- Click OK to close the component editor.
Congratulations! You have successfully configured the Parquet Destination component.
Parameters
Configuration
Use the parameters below to configure the component.
Indicates the destination of Parquet data. The following options are available:
Value Description File The Parquet data will be stored in a file. Select an existing File Connection Manager or create a new one. Variable The Parquet data will be stored in a variable. Select a variable or create a new one. Represents the maximum number of rows in a parquet row group. Row group is a logical horizontal partitioning of the data into rows. It holds serialized (and compressed) arrays of column entries.
JSON string representing the schema of the Parquet file.
What's New
- New: Support for stream output.
- New: Automatic schema generation when attaching to upstream component.
- New: Support for dynamic data flow.
- Fixed: Incorrect lower-case headers (Thank you, Romain).
- Fixed: Missing record in each batch of records (Thank you, Romain).
- New Introduced component.
Related documentation
Parquet Destination
- New: Support for stream output.
- New: Automatic schema generation when attaching to upstream component.
- New: Support for dynamic data flow.
- Fixed: Incorrect lower-case headers (Thank you, Romain).
- Fixed: Missing record in each batch of records (Thank you, Romain).
- New Introduced component.
Parquet Source
- New: Support for stream input.
- Fixed: Incorrect lower-case headers (Thank you, Romain).
- Fixed: Failed with error "System.IO.EndOfStreamException: Unable to read beyond the end of the stream." (Thank you, Naveen).
- New: Considerable performance improvements.
- Fixed: Data mismatch when reading from certain files (Thank you, Jessica).
- New: Introduced component.
Knowledge Base
- Where can I find the documentation for the Parquet Source?
- Where can I find the documentation for the Parquet Destination?
COZYROC SSIS+ Components Suite is free for testing in your development environment.
A licensed version can be deployed on-premises, on Azure-SSIS IR and on COZYROC Cloud.