Sign in to follow this  
CH2014

Corrupt Persist File

Recommended Posts

Hi,

I have a DF application running perfectly on a Windows 10 laptop. There is a Persist directory that contains 104 DAT files (each file is set to hold 3144960 values) file size (approx) 47.9mb each. The application was moved to the clients HMI (also Windows 10). After about a week of random use, the DF application starting to crash on startup (including Daqfactory.exe itself). I was not able to gain access to the DF Command/Alert utility to see what the error might be. However, I eventually determined that there was some sort of corruption within the  Persist directory. i.e. After deleting the Persist directory and then allowing the DF to recreate the Persist directory and the 104 DAT files the application started working again.

Some questions: -

1. What can cause the Persist file to become corrupted?

2. Is there any way to pinpoint the corrupted DAT file (as it seems a bit drastic deleting all the DAT files)?

3. Is it possible to repair the DAT file (as potentially quite a lot of data could be lost)?

 

 

Share this post


Link to post
Share on other sites

DAQFactory uses a rolling buffer file for persist files.  That means that when a data point is added to the persist file, only three 8 byte chunks are actually written to disk.  16 bytes is the new data, and 8 bytes is used as a pointer to the most recent data.  While that is fine for the 16 bytes as that is always a new place on the disk, the 8 byte pointer is going to be the same place on the disk always.  Newer SSD drives should wear level and this should not cause a problem, however, it does mean that if this pointer is corrupted then things go to hell.  So:

1) probably some disk failure on your system.  The first thing I would do if it happens again is see if the files are all still the same size as reported by the OS.

2) not really unless you want to try and read each of the headers and validate them.  You could write a script to do this for sure,  but is probably not worth it unless it happens again.

3) probably.  If the problem is the header, you could probably rewrite it with correct data.  For sure, you could write script that loads from your log files and then recreates the persist data by adding the log data back into the channel.  Just remember, persist files aren't meant to be a replacement for logs.  Important data should always be logged.  Persist files are just for rapid, easy access to historical data.  You could also not use persist and instead access historical data from logs.  The easiest way to do this would be to log to a MySQL database and perform queries on the data as needed.  It won't perform quite as fast, but should be more than adequate if you use db.queryToClass()

 

Share this post


Link to post
Share on other sites

Thanks for the quick reply and the useful information. There is a "log to file" also set up in the application, the persists are just used for the graphs. I will try out your suggestions just in case the problem crops up again. Last question: If the HMI lost power while the DF was processing the write to the persist file.  Could this cause a file corruption?

Share this post


Link to post
Share on other sites

Certainly.  A loss of power can cause all sorts of problems even outside DAQFactory.  I once lost power during Windows boot and had to reinstall the OS (this was a long time ago....)   These problems don't crop up often because systems typically are only writing to disk a very small percentage of operating time, but they do happen.   That is why you should always have a UPS in place with a connection to the PC so that Windows can gently shut things down if there is a sustained power loss.  A laptop achieves the same thing, but of course they tend not to be rugged.

Share this post


Link to post
Share on other sites

On further checks. The drive is an SSD (64GB). With ref to your July 28 post above: -

"Newer SSD drives should ???? wear level and this should not cause a problem, however, it does mean that if this pointer is corrupted then things go to hell" 

Is this statement entirely referring to SSD's?

 

The DF app froze again earlier this week (Blank White Screen, Spinning Windows 10 Halo) after about 9 days of nonstop operation (Note: There was no power failure). App worked again after renaming Persist Dir to PersistOLD and recreating Persist Dir & Files. (Note: App has been working perfectly on a Windows 10 laptop)

I have asked the client to run chkdsk on the SSD. Just waiting for the results. 

Share this post


Link to post
Share on other sites

It refers to any non-magnetic drive.  As I said, newer systems have better wear leveling.  Older style ones didn't and so if you rewrote to the same place on the disk over and over you could very quickly kill that part of the disk.  While newer ones work better, they are not completely immune.

 

Share this post


Link to post
Share on other sites

The client ran the error checking on the SSD and no errors were found. So something else is corrupting the Persist Directory. As mentioned above, I've not had any problems with the application running on a Windows 10 laptop. Do you know of any Windows 10 issues that might cause the Persist files to become corrupted? Is it possible that a Windows 10 issue is causing DF to freeze/crash which in turn is then damaging the Persist files? Some info: DF is run in Administration mode and the Persist Directory has "Permission for everyone" set to Full Control, Modify & Write. Could the sleep settings in Windows 10 Power Managment cause a problem?

Share this post


Link to post
Share on other sites

Does this happen often, or did it just happen once?

Do you have auto-updates disabled?  In WIn10 it is kind of tricky to do.

Share this post


Link to post
Share on other sites

Its happened twice in last 3 weeks. Nothing has been done to the auto updates settings as the hmi is not connected to the internet.

Share this post


Link to post
Share on other sites

Any chance you could send me the .ctl document.  You can email to support @ if you don't want to post it.

Share this post


Link to post
Share on other sites

Thanks for the reply. I might have access to the actual HMI soon so I will do some checks on the actual hardware and see if I can see what the problem might be. If I can't spot anything I will email the .ctl file thanks. Just for extra info, I have had the app working perfectly on an old  (2008) Asus Eee900 PC (Celeron M 353 with 2GB of Memory)  slightly sluggish on the screen actions but otherwise functionally fine. It's been running for 8 days without any problems.

Share this post


Link to post
Share on other sites

 

Hi Again,

With ref. to the above problem. I now have access to the identical HMI hardware (op sys: Windows 10) and have been running the DF program for approx. 6 days now. A few days ago I noticed that the DF app operating very abnormally. i.e. All the DF sequences had significantly slowed down (e.g. comms polling, screen animations) plus the Windows Spinning Blue Circle was momentarily being displayed along with some screen freezing. I checked "Windows Task Manager" and noticed a  background process called "Antimalware Service Executable" reporting quite a high CPU usage. "Antimalware Service Executable" is the process name for  Microsofts "Windows Defender" virus program. After googling, I noted that "Windows Defender" can at times have a very high CPU usage causing software programs to slow down or crash. The only way to permanently disable "Windows Defender"  Real Time protection is to install a 3rd party Anti-Virus Program or  (A. an edit registry option or B. Add the entire C:\ drive to the "Windows Defender" Exclusions option in the "Windows Defender Security Center") plus switch OFF the Real-Time Protection switch in "Windows Defender Security Center"  Note:- Doing all this is OK on this HMI system as its dedicated to the DF program (i.e. with no internet access). Anyway, since disabling "Windows Defender" the DF app has been working smoothly & perfectly. I will however still keep checking just to be sure. 

 

Share this post


Link to post
Share on other sites

Thanks for the update!  I'm not a big fan of 3rd party anti-virus though as it is often bloatware.  I usually use Microsoft Security Essentials, which may have been incorporated into Win 10 as Defender?

 

Share this post


Link to post
Share on other sites

Hi,

Further to all the above posts and in particular my last post. Although Windows Defender anti-virus did affect the performance of DF whilst doing its virus scan it was not the cause of the lockups described in this thread. The lockups are still occurring but what was not mentioned above is that the lockups are in fact complete PC freeze ups. Even the keyboard BIOS does not respond (confirmed by CAPS LOCK LED not operating). I'm thinking that this is some sort of hardware failure but the thing is, it only seems to happen when running the DF application on this particular HMI hardware. I have used stress test software to test GPU, CPU, RAM plus Test Write Temp files. All tests PASS OK. As mentioned above the DF application works without problem on for example a Windows 10 laptop whereby the only main difference being that the laptop has a magnetic drive as opposed to the HMI which has an SSD. Have you ever known a  DF app to completely lockup a computer? I found out today that the same thing seems to be happening on an identical HMI with the same DF app running.

On rebooting the HMI, sometimes the Persist file directory is corrupted and needs to be recreated.

Just to recap. There is a Persist directory that contains 104 DAT files (each file is set to hold 3144960 values) file size (approx) 47.9mb each. The #Hst is set to 10.

Share this post


Link to post
Share on other sites

Further to the above. I have been in contact with the HMI hardware manufacturer. They said the only similar issue whereby another customer had a total freeze-up of the hardware was related to the "C State Report" setting in the BIOS. The solution was to disable the "C State Report" setting.

I have now disabled "C State Report" in the BIOS and now checking to see if it has solved the problem. 

Share this post


Link to post
Share on other sites

There's nothing glaringly wrong in your document, though you should watch your indentation.  Though that won't affect the way DAQFactory, it does affect how readable your code is.

Also, your giant boolean expressions can be written much simpler.  You can use bit math to quickly tell if 4 bits are in a certain configuration, just & the value with 15 and compare.  So, if you want all 4 bits off, just do:
 
(S1 & 0x0f) == 0

For any bits 0-2 on and bit 3 off, do:

((S1 & 0x0f) < 8) && ((S1 & 0x0f) > 0)

I also would do a for loop and use evaluate() to check all 67 items if I stuck with S1-S67.

But if you had S1-S67 in array called S with 67 elements [0] -> [66], it gets even simpler and you can pretty much do it in a single line:

All of the 4 LSB's off for the entire S array:
max(S & 0x0f) == 0

Any bits 0-2 on and bit 3 off:

min(((S & 0x0f) < 8) && ((S & 0x0f) > 0)) == 1

 

Share this post


Link to post
Share on other sites

Thanks for that. I try to keep on top of the indentation but it looks like I have missed some areas. I will look at converting to the bit math method. The app is still working fine (7 days)  with the BIOS "C-State Report" disabled. What are your thoughts on the Windows 10 control of C-State?

Share this post


Link to post
Share on other sites

Hi,

Further to the above. Its been a while but yes definitely better with Bit Math thanks. Managed to drastically reduce the code plus also used a function for the repetitive in between code.  With ref. to your reply above i.e.

Any bits 0-2 on and bit 3 off:

min(((S & 0x0f) < 8) && ((S & 0x0f) > 0))

Is there a typo and should be max(((S & 0x0f) < && ((S & 0x0f) > 0))  ?

Also,

An unrelated question to the topic. Is it OK to have DF alarms setup with a condition code that includes a non-existent channel name? I haven't noticed any errors but thought I'd ask. The reason for asking is that I want to avoid having to delete alarms when modifying the application to a smaller system.i.e. can I just leave the unused alarms in place?

 

 

 

 

Share this post


Link to post
Share on other sites

Alarms with expressions that are invalid won't trigger, nor will they generate an alert, so you should be fine.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this