URL Shortener aka TinyURL System Requirements
- Generate a unique tiny URL
URL Shortener aka TinyURL System Components
- createTiny(longURL) -> tinyURL (API 1)
- getLong(tinyURL) -> longURL (API 2)
- Client talks to service using REST endpoint
- Loadbalancer (front end for service) handles the request
- Worker host (W)
- API 1
- Takes URL
- Generate (random) tiny URL
- Stores both URL in persistence layer (Caching + DB)
- API 2
- Get Longer URL for shorter URL and returns it to the client
- API 1
- Caching Layer
- Memcache or Redis
- Database Schema
- Key => Tiny URL
- Value => longURL
How to generate Tiny URL
- Suppose we have a total of 62 characters, viz
- a-z = 26
- A-Z = 26
- 0-9 = 10
- If my tiny URL is 7 characters long. I can have 62^7 combinations ~= 3.5 trillion
- If we are generating 1000 tiny urls/sec it will take 110 years to exhaust this combination
- Any number from 0 to 3.5 trillion can be represented by 43 bits
How to convert 43 bits to tinyURL
- Convert binary 43 bits into decimal. Suppose that number is 1,324,231
- Convert (above) decimal number into base 62. Suppose that number is 60,25,27
- Map above numbers to characters, e.g. (a-z => 0-25) (A-Z => 26-51) (0-9 => 51-61). Above number will map to 8zB
Techniques to store Tiny URL
- Checks whether generated tinyURL is present in DB (doing get(tiny) on DB).
- If tinyURL isn’t present in DB then put longURL and tinyURL in DB (put(tinyURL, longURL)).
- Problem: This technique creates race condition because it’s possible that two threads may be simultaneously adding same data to DB and may end up corrupting data.
- Adding tinyURL and longURL in DB, if there is no key whose value is equal to tinyURL i.e. putIfAbsent(tinyURL, longURL)
- Requires support from DB (May or may not present in NoSQLDB)
Technique 3 (MD5 Approach)
- Calculate MD5 of longURL
- Take first 43 bits and use that to generate tinyURL
- Now check the DB (using technique 2 discussed above) to verify that tinyURL is not used already
- Advantages of MD5 approach: For two users using same long URL,
- In random generation, we generate two random tinyURL, so 2 rows in DB
- In MD5 technique, MD5 of longURL will be same for both the users and hence same first 43 bits of URL will be same and hence deduping will save some space as we will only create one row, saving space
Counter based approach
Single host approach
- Single host is responsible for maintaining a counter e.g. database, Zookeeper instance
- When the worker host receives request they talk to counter, which returns a unique number and increments the counter.
- Every worker host gets a unique number which is used to generate tinyURL.
- Single point of failure
- Single point of bottleneck
All host approach
- Every host tries to maintain the counter
- Suppose there are 64 worker hosts and we use unique 6 (bits used to represet number from 0-63) bits for each of them + 32 bits (of current timestamp) + 5 bits (random/incremental value)
- Collision probability is high (for 20 rps on each host randomly generated last 5 bits will lead to high collision rate + for rps > 32 (5 bits represent 32) we will have sure collisions)
- Adding new hosts will be problematic (6 bits representation will not work)
Range based approach
- We take 1st billion combinations from 3.5 trillion combination
- We have divide the 1st billion into 1000 ranges of 1 million each which is maintained by Zookeeper
- 1 -> 1,000,000 (range 1)
- 1,000,000 -> 2,000,000 (range 1)
- …. (range n)
- 999,000,000 -> 1,000,000,000 ((range 1000))
- Worker thread will come to Zookeeper and ask for unused range. suppose W1 is assigned range 1, W1 will keep incrementing the counter and generate the unique number and generate tinyURL based on that
- When a worker thread exhausts their range they will come to Zookeeper and take a fresh range.
- Guaranteed No Collition
- Addition of new worker threads is easy
- Worse case is we will lose a million combinations (if a worker dies) but that’s not severe as we have 3.5 trillion combinations.
- When 1st billion is exhausted, we will take 2nd billion and generate ranges to work with.
- Generating URL in sequential manner can be a sequential threat.
- To mitigate that we add random bits at the end of 43 bits (which will increase the length of tinyURL)
- Create Request will go to worker thread through the load balancer
- Worker host will check whether its (Range based) counter is exhausted or not
- If exhausted, it will take a fresh range from Zookeeper
- Worker thread will take a new number (from the range), generate tinyURL and stores tinyURL and longURL in DB and in distributed cache.
- We have placed tinyURL in cache because initially tinyURL will be accessed a lot.
- Get Request will go to worker thread through the load balancer
- Worker thread will first check the cache. If not found, it will return data from DB.