URL Shortener aka TinyURL is a very common problem of system design in interview questions. Let’s begin by defining the
Design a tiny URL service with following requirements.
Let’s try to solve this problem by building a Webservice exposing RESTFul endpoints. This service is responsible for generating tinyURL for longURL and returning tinyURL in response. Behind the scene, it uses a Database to store
tinyURL -> longURL mapping.
We can divide our system into following two major components, viz
This component consists of Worker hosts (W) behind a
Loadbalancer running web service exposing following RESTFul API to Client,
tinyURLpresent in URI
This is the Database layer which persists the mapping of TinyURL and LongURL. Database Schema can have columns of TinyURL, LongURL and a unique identifier.
We have also used a catching layer (Memcache or Redis) to improve the response time.
That’s a decent Architecture, but the crux of this problem is
Suppose a tinyURL is 7 characters long and each character can be taken from following 62 alphabets
We can generate 62^7 ( ~= 3.5 trillion) tinyURLs from 62 alphabets having 7 characters length. Also, if we are generating 1000 tiny URLs/sec, it will take 110 years to exhaust this combination which is a decent time to provide the service.
On a side note, a number from 0 to 3.5 trillion can be represented by 43 bits. Steps to convert 43 bits to tinyURL are as follows,
1,324,231will map to
Moving on, Let’s look into
tinyURL, is present in DB
Solution to Problem 1
Add (tinyURL, longURL) combination in database only when there is no key whose value is equal to tinyURL i.e. putIfAbsent(tinyURL, longURL). This approach requires support from DB (May or may not present in NoSQLDB)
Solution to Problem 2
We calculate MD5 of longURL and take first 43 bits to generate tinyURL. MD5 of longURL will be same for duplicate URLs. Hence first 43 bits of duplicate URLs will be the same generating common tinyURL.
putIfAbsent feature (discussed as a solution to Problem 1) can be hard to find in a Database. So let’s look into an alternate technique to prevent the race condition.
Technique 2: Counter based approach
In this approach, we use a unique number to generate tinyURL. This unique number is fetched from a single independent service as a counter variable which is incremented each time a unique number is requested.
When a worker node receives a request to create tinyURL, it fetches a unique number from counter service and uses it to generate tinyURL. As different hosts have a different unique number, they will generate different tinyURLs avoiding the race condition.
Problem: This approach creates single point of failure/bottleneck.
Solution 1: Every host maintaining the counter
Suppose there are 64 worker hosts and we know that we need 43 bits for a unique tinyURL.
We can calculate 43 bits as follows
This solution has its problems, viz
Solution 2: Range based approach
In this approach, we take 1st billion combinations from 3.5 trillion combination and divide the 1st billion into 1000 ranges of 1 million each which is maintained by Zookeeper
Worker thread will come to Zookeeper and ask for an available range. Suppose W1 is assigned range 1, W1 will keep incrementing the counter and generate the unique number and generate tinyURL based on that
When a worker thread exhausts their range, they will come to Zookeeper and take a new range. When 1st billion is exhausted, we will take 2nd billion and generate ranges to work with.
This approach has the following benefits
Challenges with this approach as follows
** High-Level Architecture of Range based approach**
tinyURLrequest comes to worker thread through the load balancer