Friday, April 04, 2008

4-states state machine for CSV parsing

Parsing CSV file is easy, it's nothing but splitting string with comma delimiter, which can be easily done in Java... The first thing came to my mind when I'm about to parse CSV file in Java is just like that. Now, reality is that following examples are all possible valid lines in a CSV file
  • 1,Bender
  • 2,"Bender"
  • 3,"Bender, Bending"
  • 4,"Ben""d""er"
  • 5, Ben"der
  • 6, Ben""der
Line 7 might be arguable but anyway, two basic rules are
  • If there's comma in field, use double quot to wrap field, otherwise double quot wrapper isn't required.
  • Inside double quot, double quot is used to escape double quot.
Suddenly the problem is complicated to something more than string splitting, however it can be simplified into a finite state machine with 4 states.

States:
  • 1. Ready for new field (initial state)
  • 2. Field without double quot
  • 3. Field with double quot
  • 4. Escaping or end of double quot
Transitions

*Direction*|*Condition*|*Action*
1->2 |not(" or ,)|Append character to buffer
1->3 |" |Nothing
2->2 |not , |append character to field
1|2|4->1 |, |Output complete field and create buffer for next field
3->3 |not " |Append character to buffer
3->4 |" |Nothing

No comments: