Wednesday, 19 May 2021

find-duplicate-file-in-system

Question-1:

Imagine you are given a real file system, how will you search files? DFS or BFS?

Answer:

core idea: DFS

Reason: if depth of directory is not too deeper, which is suitable to use DFS, comparing with BFS.

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.BasicFileAttributes;

import java.security.MessageDigest;
import java.math.BigInteger;

class Directory {
	List<Directory> subDirectories;
	List<File> files;
}

public static String makeHashQuick(File file) {
	try {
		FileInputStream fileInput = new FileInputStream(file);
		byte[] data = new byte[(int) file.length()];
		fileInput.read(data);
		fileInput.close();

		MessageDigest md = MessageDigest.getInstance("MD5");
		String fileHash = new BigInteger(1, md.digest(data)).toString(16);
		return fileHash;
	} catch (IOException e) {
		throw new RuntimeException("can't read file: " + file.getAbsolutePath(), e);
	}
}

public static void findDuplicatedFilesByMD5FromFile(Map<String, List<String>> lists, File file) {
	String fileHash = makeHashQuick(file)
	List<String> list = lists.get(fileHash);
	if (list==null) {
		list = new LinkedList<String>();
		lists.put(fileHash, list);
	}
	list.add(file.getAbsolutePath());
}

public static void findDuplicatedFilesByMD5(Map<String, List<String>> lists, Directory dir) {
	for (File file: dir.files) {
		findDuplicatedFilesByMD5FromFile(lists, file);
	}
	for (Directory curDir: dir.subDirectories) {
		findDuplicatedFilesByMD5(lists, curDir);
	}
}

public static List<List<String>> storeDuplicateFiles(Directory dir) {
	List<List<String>> ret = new ArrayList<List<String>>();
	Map<String, List<String>> listsByMD5 = new HashMap<String, List<String>>();
	findDuplicatedFilesByMD5(listsByMD5, dir);
	for (List<String> list: listsByMD5) {
		if (list.size()>1) {
			ret.add(list);
		}
	}
	return ret;
}

Question-2:

If the file content is very large (GB level), how will you modify your solution?

Answer:

core idea: make use of meta data, like file size before really reading large content.

Two steps:

DFS to map each size to a set of paths that have that size: Map<Integer, Set>
For each size, if there are more than 2 files there, compute hashCode of every file by MD5, if any files with the same size have the same hash, then they are identical files: Map<String, Set>, mapping each hash to the Set of filepaths+filenames. This hash id's are very very big, so we use the Java library BigInteger.

public static void findDuplicatedFilesBySizeFromFile(Map<Integer, List<String>> lists, File file) {
	try {
		Path filePath = Paths.get(file.getAbsolutePath());
		BasicFileAttributes attr = Files.readAttributes(filePath, BasicFileAttributes.class);
		int size = attr.size();
		List<String> list = lists.get(size);
		if (list==null) {
			list = new LinkedList<String>();
			lists.put(size, list);
		}
		list.add(file.getAbsolutePath());
	} catch (IOException e) {
		throw new RuntimeException("can't read file attributes: " + file.getAbsolutePath(), e);
	}
}

public static void findDuplicatedFilesBySize(Map<Integer, List<String>> lists, Directory dir) {
	for (File file: dir.files) {
		findDuplicatedFilesBySizeFromFile(lists, file);
	}
	for (Directory curDir: dir.subDirectories) {
		findDuplicatedFilesBySize(lists, curDir);
	}
}

public static List<List<String>> storeDuplicateFiles(Directory dir) {
	List<List<String>> ret = new ArrayList<List<String>>();
	Map<Integer, List<String>> listsBySize = new HashMap<Integer, List<String>>();
	findDuplicatedFilesBySize(listsBySize, dir);
	Map<String, List<String>> listsByMD5 = new HashMap<String, List<String>>)();
	for (List<String> list: listsBySize) {
		if (list.size()>1) {
			for (String fileName: list) {
				findDuplicatedFilesByMD5FromFile(listsByMD5, new File(fileName));
			}
		}
	}
	for (List<String> list: listsByMD5) {
		if (list.size()>1) {
			ret.add(list);
		}
	}
	return ret;
}

To optimize Step-2. In GFS, it stores large file in multiple "chunks" (one chunk is 64KB). we have meta data, including the file size, file name and index of different chunks along with each chunk's checkSum(the xor for the content). For step-2, we just compare each file's checkSum.

Disadvantage: there might be flase positive duplicates, because two different files might share the same checkSum.

Question-3:

If you can only read the file by 1kb each time, how will you modify your solution?

Answer:

makeHashQuick Function is quick but memory hungry, might likely to run with java -Xmx2G or the likely to increase heap space if RAM avaliable.
we might need to play with the size defined by "buffSize" to make memory efficient.

import java.io.RandomAccessFile;

public static String makeHashLean(File infile) {
	RandomAccessFile file = new RandomAccessFile(infile, "r");
	int buffSize = 1024;
	byte[] buffer = new byte[buffSize];
	// calculate the hash of the whole file
	long offset = file.length();
	long read = 0;
	int unitsize;
	while(read<offset) {
		unitsize = (int) (((offset-read)>=buffSize)?buffSize:(offset-read));
		file.read(buffer, 0, unitsize);
		md.update(buffer, 0, unitsize);
		read += unitsize;
	}
	file.close();
	String hash = new BigInteger(1, md.digest()).toString(16);
	return hash;
}

Question-4:

What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?

Answer:

hashing part is the most time-consuming and memory consuming.
optimize as above mentioned, but also introduce false positive issue.

Question-5:

How to make sure the duplicated files you find are not false positive?

Answer:

Question-2-Answer-1 will avoid it.

Tuesday, 11 May 2021

concurrent number of users - System Design

There are a few questions i would asked here.

are the complete 1M users trying to visit the website from. a single region
are there any other backend calls that needs to be done or the website just handles counting the no of people visiting
when we say currently do they want to know per second or do they want to know per minute?

Approach :
there are ~700 requests that comes in per min if we consider uniformity so we can design a backend system that would be scaled horizontally based on the traffic (assumption is that we can have a good amount of available resources to scale). a loadbalancer to balance traffic from different regions.
it is better to have 2 API's defined so that we can increment the count as soon as the api is hit where as another api is being continuously called for the count
we can store count in redis( one DB write in central region) we need a lock here because if there is a parallel write, redis takes the last write. we replicate the redis cache on all regions just to optimize some performance and make it access easier to read thus reducing the load.

It gets a lot trickier if all the requests are from the same region , and if we make the worst case assumption we can have all 1M requests coming in 1second and nothing else in rest of the day. so this would put a limitation on how much we can replicate or scale.Here we might have to get into a LLD where we would have to optimally write the code. check what kind of the protocol requests can be made to reduce latency(probably a grpc vs rest). we might have to think about a queeing system so that we can send only x requests optimally and not crash the system by sending 1M loads.

Anti Virus System Design

As file system is a tree like structure, AntiVirus can scan from a drive (considering its a windows system), take all directories as its children node. and scan them one by one, alphabetically.

So a BFS scan should work.
Now when a file is scanned it would generate MD5 hash of that file's content and match with its existing database, It can be log(N) complexity if binary searched or if the database is a hashtable then, O(1) search can be done.
When the scan of a file is completed it will update the current completed node to this file, and move for next one. So that if rebooted, instead of searching fromt he beginning it can start after that file.

Edge case scenario:

improper shutdown: while updating current scanned file, if improper shutdown happens, the file can be corrupted. To avoid that, we can use 2 files (recent_scanned, before_recent_scanned). and update before_recent_scanned first by copying the recent_scanned, and then write in recent_scanned file. this way, it is possible to prevent corruption of the files.
some files are copied: while scanning, user can copy some files from one directory to another directory. Here's two scenarios can be happened. Either he is copying something from already scanned to another location or he can copy non-scanned files to scanned location. We can ignore the first case (for better optimization we can also consider this case). For second case, We can make a list of those files and save it for scan later. Or we can also scan before writing that file. For all of these we have to use system api.
New file written by some programs: If any new file is written by some other programs, with system API we can also know which files are written, and scan them.
Optimization: To optimize, we can use multithreading, All worker threads will take a file from current pointer, save the file name to disk/ (in a file named, currently_scanning.txt), and start scanning, when finished, remove that file name from the disk and continue taking another non-scanned file from the pointer. While taking a new file, it must lock the current pointer file to avoid duplicate scanning by another thread. To create better user experience, it can check current processor idle status, free ram etc. and update total thread count if necessary.
virus database updated: virus database can be updated frequently as new viruses released very often. If a complete scan happens earlier, anti-virus program can save the md5 hash in a file and cross check with the updated virus hashes.

These are the ideas that came into my mind. Let me know if there's a better way to resolve this.

Sunday, 11 April 2021

Amazon Leadership principles

I had done some special preparation for the LPs this time. I had prepared a document of around 15 LPs and added two scenarios for each using the STAR technique. I went through them once before every interview. This helped me a lot. All the LPs asked in the interviews were from this document. In the interview, I was able to shorten the time to discuss LPs because I didn't have to think about a scenario. I already knew it.

Customer Obsession
1.) When you asked for customer feedback and how did you use that feedback to drive innovation and how did customer respond.
2.) When you evaluated customer experience about the product. What did you do and what was the result?

Ownership
3.) When you took on something significant outside your area of responsibility? why?
4.) When you did not think you are going to meet a deadline you promised. How did you identify the risk and convey to stakeholders?

Deliver Result
5.) When you were able to deliver an important project under a tight deadline? What sacrifices did you make to do so? How did they impact and the outcome?

Earn Trust
6.) When you received a critical feedback. What was it and what did you do?

Learn and be curious
7.) When you realized you needed a deeper matter of subject expertise to do your job well and what was the outcome.
8.) When you did not know what to do next or how to solve a challenging problem. How do learn and what were the options considered. What was the best was forward and what was the outcome?

Insist on higher standards
9.) When you refused to compromise your standards around quality and customer service? Who was the customer?
10.) When you worked you improve the quality of product of service that was already getting good customer feedback? Why did you do that and how did customer react.

Dive Deep
11.) A job that required to dig deep to get to the root-cause? How did you know you were focusing on the right things and what was the outcome?

Other General LPs
12.) Describe a time when you took a calculated risk.
13.) Describe a time when you took a shortcut.
14.) Conflit with manager in any project.
15.) Three things you are doing to improve yourself.
16.) Tell about a time when you convinced someone with your opinion and you were right.
17.) What is the most innovative work you have done?

Here is an example for a LP scenario from my document. This is a real example from experience in my career.

LP scenario - When you realized you needed a deeper matter of subject expertise to do your job well and what was the outcome.
Situation - When I had to take the entire responsibility of managing the azure cloud resources for the project.
Task - The task was to be able to configure, create, and manage azure resources.
Action - Studied azure fundamentals, and obtained Azure Fundamentals certification, then went deep into each resource being used one by one. Took three weeks to master it.
Result - Now, I have full confidence in azure, and I can design any such project seamlessly.

Saturday, 13 February 2021

Java: How we can decompile java file | JD Tool | .class to .java file

Saturday, 2 January 2021

Chaos Monkey NetFlix and Monkey testing

Chaos Monkey is used to test production system stability when randomly some instances failed.

It ensures the production system resilient to instance failure.

Monkey Testing:

when we randomly apply inputs to system and verify its behavior.

Thursday, 16 July 2020

Recursion series | Problem BunnyEars | CodingBat | Java | Understand Rec...

Saturday, 27 June 2020

Count the number of Set bits in an Integer/decimal number | Java | interview question

public class SetBits {

public static void main(String[] args) {
// how to find number of set bits in Decimal number
// means number of 1's present in decimal number

int num = 5 ;
int count = 0;
// 8 bits representation of 5 : 00001001

while(num>0) {

if((num & 1) == 1) {
count++;
}

num=num>>1;
}

System.out.println(count);
}
}

Output : 2

Youtube video link:
https://youtu.be/Jh9i1OLLfks