Technical Stuff: February 2016

Saturday, February 20, 2016

S3 load errors in Redshift(AWS) COPY command

We have faced lot of weird issues while loading S3 bucket files into redshift. I will try to explain all issues what we faced. Before going that , lets see what are Valid S3 file should contain

No.of values in S3 bucket are exactly equal to the no.of columns in the redshift table
Each value in S3 separated with a delimiter, in our case its pipe(|)
Each line in S3 file is exactly one insert statement on redshift
Empty values will be passed in the S3 file for corresponding optional field in table

To store S3 file content to redshift database, AWS provides a COPY command which stores bulk or batch of S3 data into redshift.
Lets assume there is a table testMessage in redshift which has three columns id of integer type, name of varchar(10) type and msg of varchar(10) type.

S3 file to redshift inserting COPY command is below

copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_'

To insert values from S3 file, sammple S3 file could be

77|chandu|chanduthedev

In this file total values are three which is equal to no.of columns in the testMessage table columns and each value separated by pipe(|) symbol.

Lets see another S3 sample file
88||chanduthedev

In this file, we have one empty value for name column in table testMessage in redshift. So far so good. Lets take some S3 files which cause to fail redshift COPY command

99|chanduthedev

In this S3 file contains only two values 99 and chanduthdev, and missing third value which causes to file S3 load COPY command

99|ch
and u|chanduthedev

In this file, second value is ch\nand u which conatins new line(\n) characters, so it becomes two rows in the S3 file which means two insert statements to redshift COPY command and. First row becomes two value insert statments which is invalid and second one is another two value invalid statement.

For these invalid S3 file you may get below error message.

Load into table 'testMessage' failed. Check 'stl_load_errors' system table for details
and in AWS you may get below error
Delimiter not found

Lets take another failure S3 file which has delimiter as value for name column

77|chan|234|chanduthedev

In the above S3 file, it looks 4 values because of extra pipe(|) character for the name chan|1234 which causes redshift COPY command to treat S3 file has four values, but table has three values.

For S3 load failures, the most common reason could be special characters or escape characters like new line(\n), double quotes("), single quotes etc. Either you need to escape those special characters or remove those special characters.

We followed later idea of removing special charasters while processing and storing in the redshift. But later came to know that we can use ESCAPE key word in COPY command.

copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE

adding ESCAPE to COPY command will solve lot of these issues. So always check for ESCPAPE

Happy Debugging...

Friday, February 19, 2016

String length exceeds DDL length - S3 bucket Load error on redshift(AWS)

We have recently faced one tricky issue in AWS cloud while loading S3 file into Redshift using python. It took almost whole day to indentify the issue and fixing it.

Our way of doing things in AWS cloud as below

Get the json message from SQS using python
Validating the fields from json message received in step 1
Make a csv format after validating and store file in S3 bucket in AWS using csv python library
Load S3 file into AWS redshift database using copy command

The above process in simple terms, read the message, proces it and insert into redshift Database. In this process there could be a chance of failures like

While validating json message from SQS, we may get invalid input which python cant identify like escape characters - this also we faced and I will make another blog post on this soon
While write to s3 file we may get some extra escape characters - This problem we faced and I am covering now
While inserting into redshift db

if we try to insert invalid datatype value in the column (for integer column trying to insert varchar value)
if the inserting value exceeds the length of the column (like for msg column lenght is 10, if we try to isnert more than 10 chars it will fail) - this was due to step 2
if S3 file does not have proper delimiters

For simplifying big problem, I am assuming there is one table testMessage in redshift which has two columns id of integer type and msg of varchar(10) type.
To insert the values into testMessage table using above process, we are expecting a json message which contains id and msg keys. Sample message shown below

{"id":7, "msg":"testfile"}

As per the above four step process

json message from SQS

{"id":7, "msg":"testfile"}

Validating key - values in the json message

looks valid as id field contains integer and msg field contains string less than or equal to 10 chars

CSV format S3 file

id|testfile

Loading to redshift

copy testMessage (id, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE

This will work fine as there are no validation failures and no sepecial characters in the message. Lets take another message which contains special characters like double quotes

{"id":7, "msg":"\"testfile\""}

json message sqs message

{"id":7, "msg":"\"testfile\""}

Validating key - values in the json message

looks valid as id field contains integer and msg field contains string less than or equal to 10 chars

CSV format s3 file

id|""testfile""

Loading to redshift

copy testMessage (id, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE

Bhooom ....
Loading to redshift copy command fails with error message 'String length exceeds DDL length '.
Here comes the actual problem we faced!!!!!

If we observe in step3 S3 file, actual msg value is '"testfile"' whose length is 10, but we can find two extra double quotes in the s3 file which causes exceeding the length of the string to 12 from 10.
From where all these extra double quotes has come??
Until validations and processing message everything looks fine, but while writing into s3 file with CSV format, CSV library in python adds extra escape characters which causes actual problem.

In our case we have double quotes which is a special character, and csv library adds another double quote as escape character which increase length from 10 to 12 which causes the problem

To avoid this problem, we can use

csv.register_dialect(dialect, doublequote=False, escapechar='\\', quoting=csv.QUOTE_NONE)

This means we are making doublequotes as false and treating escape characters are empty.

escapechar='\\' this specifies, no escape characters are there, treat all special characters are normal characters.
escapechar='' this doesn't mean no escape characters, which means empty as escape character

to pass empty value as escape characters use point 1 and not point 2. If you use you will get below error which means its expecting a escape character, but you have not specified any value

Error: need to escape, but no escapechar set

This fix looks simple in this example, but in real time scenario we have almost 40 columns in a table and the SQS json message was more than 7000(7K) characters with lot of special characters like newline(\n), carriage return(\r), double slashes(\\), double quotes(") etc .... We worked a lot to identify this issue and sharing here to help others.

Hope this helps ...
Happy Debugging ......

Technical Stuff

Pages

Saturday, February 20, 2016

S3 load errors in Redshift(AWS) COPY command

Friday, February 19, 2016

String length exceeds DDL length - S3 bucket Load error on redshift(AWS)

Popular Posts

Blog Archive