Saturday, February 20, 2016

S3 load errors in Redshift(AWS) COPY command



We have faced lot of weird issues while loading S3 bucket files into redshift. I will try to explain all issues what we faced. Before going that , lets see what  are Valid S3 file should contain

  • No.of values in S3 bucket are exactly equal to the no.of columns in the redshift table
  • Each value in S3 separated with a delimiter, in our case its pipe(|)
  • Each line in S3 file is exactly one insert statement on redshift
  • Empty values will be passed in the S3 file for corresponding optional field in table


To store S3 file content to redshift database, AWS provides a COPY command  which stores bulk or batch of S3 data into redshift.
Lets assume there is a table testMessage in redshift which has three columns id of integer type, name of varchar(10) type and msg of varchar(10) type.

S3 file to redshift inserting COPY command is below

copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_'

To insert values from S3 file, sammple S3 file could be

77|chandu|chanduthedev


In this file total values are three which is equal to no.of columns in the  testMessage table columns and each value separated by pipe(|) symbol.

Lets see another S3 sample file
88||chanduthedev

In this file, we have one empty value for name column in table testMessage in redshift. So far so good. Lets take some S3 files which cause to fail redshift COPY command

99|chanduthedev

In this S3 file contains only two values 99 and chanduthdev, and missing third value which causes to file S3 load COPY command

99|ch
and u|chanduthedev

In this file, second value is ch\nand u which conatins new line(\n) characters, so it becomes two rows in the S3 file which means two insert statements to redshift COPY command and. First row becomes two value insert statments which is invalid and second one is another two value invalid statement.

For these invalid S3 file you may get below error message.

Load into table 'testMessage' failed.  Check 'stl_load_errors' system table for details
and in AWS you may get below error
Delimiter not found 

Lets take another failure S3 file which has delimiter as value for name column

77|chan|234|chanduthedev


In the above S3 file, it looks 4 values because of extra pipe(|) character for the name chan|1234 which causes redshift COPY command to treat S3 file has four values, but table has three values.

For S3 load failures, the most common reason could be special characters or escape characters like new line(\n), double quotes("), single quotes etc. Either you need to escape those special characters or remove those special characters.

We followed later idea of removing special charasters while processing and storing in the redshift. But later came to know that we can use ESCAPE key word in COPY command.


copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE

adding ESCAPE to COPY command will solve lot of these issues. So always check for ESCPAPE

Happy Debugging...


No comments:

Popular Posts