Monday, September 2, 2019

How to validate data file is formatted properly before training ML model!!


In machine learning(ML), clean data for training  is the key for a better model.
Must remove noise(unwanted data like special characters) from input data to get clean clean data.
.To do this one of the first step is to make sure data format is consistent in entire input file.
One of the best way to check this is below command. This command should return
Only one value, if not your data file is not properly formatted.

cat file_filename | awk -F’,’ ‘{print NF}’ | sort -u

Let’s see in detail about this command. This command contains below sub parts

cat: is the command to display contents of a text file
awk: this is very power full text processing tool.
sort:this command is to sort and u is for unique and removing duplicates
NF: is number of fields. This will be used In AWK
Pipe(|): this is also very power full command line utilitiy which combines more than one command.

Let’s take a sample file which contains name, age and city of some customers separated by comma, which means each line should contain only three fields separated by comma. But if you see in the below data, in the second line it contains four fields separated by comma.This is very very common scenario in Machine learning training data. And checking this format in a large file not easy. Using above command we can easily verify it.

Chandra, 35,Singapore
Sekhar,Chandra, 26,Guntur
Sekhar,35,Singapore

Applying above command on this text will return the results 3 and 4 as shown below as input data is not consistent. This is due to second line contains one extra comma in it.


After removing extra comma from the second line, that command returns only one value which is 3 as shown below. This return value may vary basing on number of fields in each line, but each line should contain same number of fields which makes this command returns only one value.



cat command will display the contents of the file and redirect to AWK because of pipe and awk will split each line with comma as -F is a field separator and NF will return all the fields after split each line. In this case we have three lines in that first and third lines has three and second line has 4 after split using comma. Sort will show the sorted result with duplicates and -u will show only unique values.

This command will work efficiently on very large files as well. I tried on data files whose file size is around one million lines and it returned results in seconds.


Happy Learning!!!

Popular Posts