This week we mostly worked on optimizing the parsing related code and getting all test cases pass. We are also thinking about a bit of refactoring in code design but we will msotly handle that later as there are still more things to consider.
We came up with a way to parse categorical data and pass all test cases, there is some scope of improvement in the logic maybe but it works well for now and supports everything that mlpack use to. Logic is here
if (token[0] == '"' && token[token.size() - 1] != '"')
{
std::string tok = token;
while (token[token.size() - 1] != '"')
{
tok += delim;
std::getline(lineStream, token, delim);
tok += token;
}
token = tok;
}
Consider this example
1,2,"string, with comma",3
Basically as we are using getline(lineStream, token, ',')
it will read
the string in-between the commas. So as we will reach the 3rd column we will
first have
token = "string"
but this is not the complete token- save this token value in a string and append
,
at the end - repeat step 1 & 2 till we have
"
as the last char of the token
Also Conrad suggested a way to handle sparse matrix
which is just a
small addition in our csv parse
for dense matrix. That helps a lot as
now we can remove the Load()
which we have specifically for sparse matrix.
There are some issues with removing the Load()
for sp_mat as this logic
only handles sparse matrix in csv
format and not other formats like txt
,
bin
and tsv
.
We will handle it but first let’s concentrate on csv parse
only.