We have faced lot of weird issues while loading S3 bucket files into redshift. I will try to explain all issues what we faced. Before going that , lets see what are Valid S3 file should contain
- No.of values in S3 bucket are exactly equal to the no.of columns in the redshift table
- Each value in S3 separated with a delimiter, in our case its pipe(|)
- Each line in S3 file is exactly one insert statement on redshift
- Empty values will be passed in the S3 file for corresponding optional field in table
To store S3 file content to redshift database, AWS provides a COPY command which stores bulk or batch of S3 data into redshift.
Lets assume there is a table testMessage in redshift which has three columns id of integer type, name of varchar(10) type and msg of varchar(10) type.
S3 file to redshift inserting COPY command is below
copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_'
To insert values from S3 file, sammple S3 file could be
77|chandu|chanduthedev
In this file total values are three which is equal to no.of columns in the testMessage table columns and each value separated by pipe(|) symbol.
Lets see another S3 sample file
88||chanduthedev
In this file, we have one empty value for name column in table testMessage in redshift. So far so good. Lets take some S3 files which cause to fail redshift COPY command
99|chanduthedev
In this S3 file contains only two values 99 and chanduthdev, and missing third value which causes to file S3 load COPY command
99|ch
and u|chanduthedev
In this file, second value is ch\nand u which conatins new line(\n) characters, so it becomes two rows in the S3 file which means two insert statements to redshift COPY command and. First row becomes two value insert statments which is invalid and second one is another two value invalid statement.
For these invalid S3 file you may get below error message.
Load into table 'testMessage' failed. Check 'stl_load_errors' system table for details
and in AWS you may get below error
Delimiter not found
Lets take another failure S3 file which has delimiter as value for name column
77|chan|234|chanduthedev
In the above S3 file, it looks 4 values because of extra pipe(|) character for the name chan|1234 which causes redshift COPY command to treat S3 file has four values, but table has three values.
For S3 load failures, the most common reason could be special characters or escape characters like new line(\n), double quotes("), single quotes etc. Either you need to escape those special characters or remove those special characters.
We followed later idea of removing special charasters while processing and storing in the redshift. But later came to know that we can use ESCAPE key word in COPY command.
copy testMessage (id, name, msg) from 's3://blogpost.testbucket/test/file.txt' credentials 'aws_access_key_id=;aws_secret_access_key=;token=' delimiter '|' ACCEPTINVCHARS '_' ESCAPE
adding ESCAPE to COPY command will solve lot of these issues. So always check for ESCPAPE
Happy Debugging...