Processing Field Specified Data


Recommended Posts

I am working on a project where data is in a csv file in the format:

 

"variableName", timeStamp, data

"anotherVariableName", timeStamp, data

etc..

 

This pattern repeats as new data comes in for about 10 different variables, so in a file with 40000 rows, there are about 4000 with one variable name, and 4000 with another, etc.  Reading this data is easy:

 

      private handle = file.Open(filePath, 1, 0, 0, 1)
      private string datain
      file.read(handle) // read header
      datain = file.ReadDelim(handle, -1, ",", chr(10), 0, 1)
      file.Close(handle)

I open the file, read the header (which I don't want), then use readDelim() to read the entire file and parse it into an array.  Since the first column is a string containing the variable names, I make sure and read the entire file as a string.  I can always convert values back to numbers.

 

Now, here's the part I wanted to really share.  The brute force way of processing the result would be to create a for loop and cycle through each row, looking at the variable name column and doing a switch statement to split the data.  For example:

 

     for (private i = 0, i < numrows(datain), i++)
         private val = datain[i][2] 
         val.time = strToTime(datain[i][1], "mdyhms")
         switch
            case (datain[i][0] == '"variableName"')
               variableName.addValue(val)
            case (datain[i][0] == '"anotherVariableName"')
               anotherVar.addValue(val)
         endcase
     endfor

In my case, the timestamp was in human readable format, so I use strToTime() to convert to seconds since 1970.

 

This was brutally slow.  The test file I was using had 40,000 rows and when I ran the script it took forever.  In fact, I gave up waiting after 10 minutes. 

 

Fortunately, there is a way to process this fast.  In fact, using the technique I'm about to share, it reduced the time to process the 40,000 row file to about 1 second, including sorting the result by time.  To do this I created a function:

 

function processFileVar(string datain, string fileVar, string dfVar)

   private string filtered
   private out
   
   filtered = filter(datain, (datain[][0] == fileVar))
   if (numrows(filtered) == 0)
      ? "No data found for: " + fileVar
      return
   endif
   out = filtered[][0]
   out.time = strToTime(filtered[][1], 'mdyhms')
   execute(dfVar + ".addValue(out)")
   execute (dfVar + " = sorttime(" + dfVar + ")")

It takes the string array from the file, the name of a field in the file, and the desired DAQFactory variable I'd like to stick the result in. 

 

The trick is to use the filter() function.  The filter function takes an array and returns a presumably smaller array where a particular condition array is one.  So, in my case, I simply filter on the first column == filevar.

 

The result is an array where all the data is for a particular variable.  Fortunately, most functions in DAQFactory work with arrays, so I was able to use both strToTime() on the resultant array and addValue() without having to use any sloooowww for loops.

 

So, in my main sequence where I had that slow loop, I replaced it simply with:

 

      processFileVar(datain, "variableName", "variableName")
      processFileVar(datain, "anotherVariableName", "anotherVar")

Hopefully this will help you think in arrays.  Since DAQFactory script is interpretted, which allows it to do things like be changed on the fly, its kind of slow.  This especially comes out in loops.  However, any function in DAQFactory that works with arrays, does all that processing in low level C++, so runs very, very fast.  So, while its often good to use the old brute force method of processing data when you are first figuring stuff out, try to optimize by thinking in arrays.

 

If anyone wants more examples, just post a loop you'd like to optimize...

 

 

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.