As file system is a tree like structure, AntiVirus can scan from a drive (considering its a windows system), take all directories as its children node. and scan them one by one, alphabetically.
So a BFS scan should work.
Now when a file is scanned it would generate MD5 hash of that file's content and match with its existing database, It can be log(N) complexity if binary searched or if the database is a hashtable then, O(1) search can be done.
When the scan of a file is completed it will update the current completed node to this file, and move for next one. So that if rebooted, instead of searching fromt he beginning it can start after that file.
Edge case scenario:
improper shutdown: while updating current scanned file, if improper shutdown happens, the file can be corrupted. To avoid that, we can use 2 files (recent_scanned, before_recent_scanned). and update before_recent_scanned first by copying the recent_scanned, and then write in recent_scanned file. this way, it is possible to prevent corruption of the files.
some files are copied: while scanning, user can copy some files from one directory to another directory. Here's two scenarios can be happened. Either he is copying something from already scanned to another location or he can copy non-scanned files to scanned location. We can ignore the first case (for better optimization we can also consider this case). For second case, We can make a list of those files and save it for scan later. Or we can also scan before writing that file. For all of these we have to use system api.
New file written by some programs: If any new file is written by some other programs, with system API we can also know which files are written, and scan them.
Optimization: To optimize, we can use multithreading, All worker threads will take a file from current pointer, save the file name to disk/ (in a file named, currently_scanning.txt), and start scanning, when finished, remove that file name from the disk and continue taking another non-scanned file from the pointer. While taking a new file, it must lock the current pointer file to avoid duplicate scanning by another thread. To create better user experience, it can check current processor idle status, free ram etc. and update total thread count if necessary.
virus database updated: virus database can be updated frequently as new viruses released very often. If a complete scan happens earlier, anti-virus program can save the md5 hash in a file and cross check with the updated virus hashes.
These are the ideas that came into my mind. Let me know if there's a better way to resolve this.
No comments:
Post a Comment