Friday, February 19, 2016

String length exceeds DDL length - S3 bucket Load error on redshift(AWS)

We have recently faced one tricky issue in AWS cloud while loading S3 file into Redshift using python. It took almost whole day to indentify the issue and fixing it.

Our way of doing things in AWS cloud as below
  1. Get the json message from SQS using python
  2. Validating the fields from json message received in step 1
  3. Make a csv format after validating and store file in S3 bucket in AWS using csv python library
  4. Load S3 file into AWS redshift database using copy command

The above process in simple terms, read the message, proces it and insert into redshift Database. In this process there could be a chance of failures like
  1. While validating json message from SQS, we may get invalid input which python cant identify like escape characters - this also we faced and I will make another blog post on this soon
  2. While write to s3 file we may get some extra escape characters - This problem we faced and I am covering now
  3. While inserting into redshift db 
    1. if we try to insert invalid datatype value in the column (for integer column trying to insert varchar value)
    2. if the inserting value exceeds the length of the column (like for msg column lenght is 10, if we try to isnert more than 10 chars it will fail) - this was due to step 2
    3. if S3 file does not have proper delimiters
 For simplifying big problem, I am assuming there is one table testMessage in redshift which has two columns id of integer type and msg of varchar(10) type.
To insert the values into testMessage table using above process, we are expecting a json message which contains id and msg keys. Sample message shown below

{"id":7, "msg":"testfile"}

As per the above four step process
  1. json message  from SQS
    • {"id":7, "msg":"testfile"}
  2. Validating key - values in the json message
    • looks valid as id field contains integer and msg field contains string less than or equal to 10 chars
  3. CSV format S3 file
    • id|testfile
  4. Loading to redshift
    • copy testMessage (id, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE
This will work fine as there are no validation failures and no sepecial characters in the message. Lets take another message which contains special characters like double quotes

{"id":7, "msg":"\"testfile\""}


  1. json message  sqs message
    • {"id":7, "msg":"\"testfile\""}
  2. Validating key - values in the json message
    • looks valid as id field contains integer and msg field contains string less than or equal to 10 chars
  3. CSV format s3 file
    • id|""testfile""
  4. Loading to redshift
    • copy testMessage (id, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE

Bhooom .... 
Loading to redshift copy command fails with error message 'String length exceeds DDL length '. 
Here comes the actual problem we faced!!!!!

If we observe in step3 S3 file, actual msg value is '"testfile"' whose length is 10, but we can find two extra double quotes in the s3 file which causes exceeding the length of the string to 12 from 10.
From where all these extra double quotes has come??
Until validations and processing message everything looks fine, but while writing into s3 file with CSV format, CSV library in python adds extra escape characters which causes actual problem.

In our case we have double quotes which is a special character, and csv library  adds another double quote as escape character which increase length from 10 to 12 which causes the problem

To avoid this problem, we can use

csv.register_dialect(dialect, doublequote=False, escapechar='\\', quoting=csv.QUOTE_NONE)

This means we are making doublequotes as false and treating escape characters are empty.

  1. escapechar='\\' this specifies, no escape characters are there, treat all special characters are normal characters.
  2. escapechar='' this doesn't mean no escape characters, which means empty as escape character

to pass empty value as escape characters use point 1 and not point 2. If you use you will get below error which means its expecting a escape character, but you have not specified any value

Error: need to escape, but no escapechar set

This fix looks simple in this example, but in real time scenario we have almost 40 columns in a table and the SQS json message was more than 7000(7K) characters with lot of special characters like newline(\n), carriage return(\r), double slashes(\\), double quotes(") etc .... We worked a lot to identify this issue and sharing here to help others.


Hope this helps ...
Happy Debugging ......





Popular Posts