Removing unwanted commas in a CSV file with PHP

During the past week, I've been working on a small PHP application that a friend paid me to write, called SearchCSV2MySQL. He is exporting data from a specific program and saving the data in Excel as *.csv. Here is what a sample of the data looks like:

130072690978,Jan-31 09:09,4.95,$,1,Vintage McMurdo SILVER Television Pre-Amplifier
220073351918,Jan-25 19:48,"1,031.00",$,2,"PITNEY BOWES, TABLETOP INSERTING MAILING SYSTEM"

As you might have guessed, you're looking at eBay auction information. The fields in the exported data are item, date, price, currency, bid count, and description. Importing large amounts of CSV data into a MySQL database is one thing (and I'll write a follow up post detailing how the application works), but I also needed to remove unwanted fields before importing the data into MySQL.

To do this, I break up each line using $fields = explode(",", $lines[$y]);, where $y is the current line we're processing. This takes the data between the commas and puts it into the $fields array. However if you look at the sample exported data carefully, you may realize that we won't get the results we're expecting. The commas found between the double quotes, in the price and description fields, will be processed as field separators! This would cause $fields[2] (the price field) for the second line to contain "1" instead of "1,031.00".

So how did I resolve this? After looking over the syntax for strpos(), substr(), and substr_replace() a million times, I finally came up with this solution:

/*
* ***************** Remove all commas from the price field ******************
*/
/* Isolate the second field (price field) by finding the position
* of the comma right before the price field (second comma)
*/
$first_comma_pos = strpos($lines[$y], ",");
$second_comma_pos = strpos($lines[$y], ",", $first_comma_pos + 1);

/* Check if the price field contains double quotes,
* which would mean the price has a comma in it and we need to remove it
*/
if(substr($lines[$y], $second_comma_pos + 1, 1) == """){
/* Find the positions of the opening and closing double quotes around the price */
$price_quotes_pos_start = strpos($lines[$y], """);
$price_quotes_pos_end = strpos($lines[$y], """, $price_quotes_pos_start + 1);

/* Find all occurences of a comma after the opening double quote, but before the closing quote,
* around the price field and remove them from this line.
*/
$price_comma_pos = strpos($lines[$y], ",", $price_quotes_pos_start);
while($price_comma_pos < $price_quotes_pos_end){
$lines[$y] = substr_replace($lines[$y], "", $price_comma_pos, 1);
$price_comma_pos = strpos($lines[$y], ",", $price_quotes_pos_start);

/* Update the position of $price_quotes_pos_end,
* since it has changed after we removed a comma!
*/
$price_quotes_pos_end = strpos($lines[$y], """, $price_quotes_pos_start + 1);
}
}

/*
* ***************** Remove all commas from the description field ******************
*/
/* Find the position of the comma right before the description field (fourth comma) */
$first_comma_pos = strpos($lines[$y], ",");
$second_comma_pos = strpos($lines[$y], ",", $first_comma_pos + 1);
$third_comma_pos = strpos($lines[$y], ",", $second_comma_pos + 1);
$fourth_comma_pos = strpos($lines[$y], ",", $third_comma_pos + 1);
$fifth_comma_pos = strpos($lines[$y], ",", $fourth_comma_pos + 1);

/* Check if the description field contains double quotes,
* which would mean the description has a comma in it and we need to remove it
*/
if(substr($lines[$y], $fifth_comma_pos + 1, 1) == """){
/* Find the positions of the opening and closing double quotes around the description */
$desc_quotes_pos_start = strpos($lines[$y], """, $fifth_comma_pos);
$desc_quotes_pos_end = strpos($lines[$y], """, $desc_quotes_pos_start + 1);

/* Find all occurences of a comma after the opening double quote, but before the closing quote,
* around the description field and remove them from this line.
* Since this is the last field, we dont need to worry about finding any
* commas after the closing quote, and therefore don't need to update $desc_quotes_pos_end.
*/
while($desc_comma_pos = strpos($lines[$y], ",", $desc_quotes_pos_start)){
$lines[$y] = substr_replace($lines[$y], "", $desc_comma_pos, 1);
}
}

I realize that there are a couple of limitations to this code, such as not removing both commas if the price contains a larger number (i.e., "1,042,240.00"). It also doesn't look in the description field for commas. UPDATE: I've updated the code to remove all commas from both the price and description fields.

This code is simply a proof-of-concept to show how I solved a problem. If you know of a better way to go about this, please let me know! Hopefully posting this snippet will save someone the time I spent figuring it out.

Write a Comment

Comment

  1. it is really a nice onew..thnkx for sharing this code, will you please update along with the complete code. thnkx in advance