I currently work with some numeric methods that accept large quantities of data as inputs. Normally the input comes from a database. As I was writing the tests, I needed a suitable external format to represent relatively large quantities of data. CSV seemed like a natural choice, because it is widely used and supported by SQL Management Studio and Excel. However, I found CSV very inconvenient, for the following reasons:
– There is no standard way to export a data table to CSV. CSV files can be read by OLE DB Jet driver, but apparently this driver cannot create new tables/files. Google is full of samples on how to write to CSV files by hand.
– There is no standard data format. Should the data be included in quotes? How do you handles strings with embedded commas? What is the date format? Apparently, there is no such things as “CSV standard” and everyone does what they please.
– There is no type information. OLE DB driver can use schema.ini
, but it is non-standard and cumbersome to work with. Also, it combines type information for multiple tables. Without explicit type information, OLE DB driver tries to “guess” the types, but often does it wrong. It seems that it looks at the first couple of dozen rows, and makes its decision based on that. If further rows are incompatible, they will suffer. The driver will either complain (“cannot convert string to integer”), or silently truncate doubles to ints, etc. , which is even worse.
– There is no null value. Empty strings are treated as NULLs on read, even if included in quotes. However, SQL Management Studio when exporting NULL values puts litereal NULL
into the file, which is read back as string.
All this makes CSV format somewhat difficult to use – it is very hard to guarantee that the data read will be exactly equivalent (in type and value) to the data written. I gave up and switched to DataTable.WriteXml() and WriteXmlSchema(), but this leads to much bigger files and is not compatible with anything but .NET. I am still looking for an alternative.