All Articles

Experience: Week Nine

mlpack-logo.png gsoc-logo.png

This week we mostly worked on optimizing the parsing related code and getting all test cases pass. We are also thinking about a bit of refactoring in code design but we will msotly handle that later as there are still more things to consider.

We came up with a way to parse categorical data and pass all test cases, there is some scope of improvement in the logic maybe but it works well for now and supports everything that mlpack use to. Logic is here

if (token[0] == '"' && token[token.size() - 1] != '"')
{
  std::string tok = token;

  while (token[token.size() - 1] != '"')
  {
    tok += delim;
    std::getline(lineStream, token, delim);
    tok += token;
  }

  token = tok;
}

Consider this example 1,2,"string, with comma",3

Basically as we are using getline(lineStream, token, ',') it will read the string in-between the commas. So as we will reach the 3rd column we will first have

  • token = "string" but this is not the complete token
  • save this token value in a string and append , at the end
  • repeat step 1 & 2 till we have " as the last char of the token

Also Conrad suggested a way to handle sparse matrix which is just a small addition in our csv parse for dense matrix. That helps a lot as now we can remove the Load() which we have specifically for sparse matrix.

There are some issues with removing the Load() for sp_mat as this logic only handles sparse matrix in csv format and not other formats like txt, bin and tsv.

We will handle it but first let’s concentrate on csv parse only.