Thanks to visit codestin.com
Credit goes to urmaul.com

Algorithms

Generating Short URLs One Character at a Time

I came up with a non-bad approach for making URL shorteners.

If you’re familiar with the URL shortener problem, you can jump straight to the Random Characters approach.

Problem

You need to make URLs shortener service. Something that can take an URL like https://www.youtube.com/watch?v=dQw4w9WgXcQ, create a short slug for it like ab42, and then when when you visit it like example.com/ab42, it redirects you to the original URL.

Although in real life I’d suggest you to avoid URL shorteners you don’t own, it’s also a popular technical interview task.

Solving this problem requires you balance trade-offs, because on the one hand, you want slugs to be as short as possible, and on the other hand you want to make sure that each generated slug is unique.

Random String Approach

A straightforward approach would be to create random slugs of a specific size. If you get a collision when trying to save a slug to the database, try again.

Characteristics of this approach:

  • Low requirements to the database, any key-value store with a collision check would suffice.
  • You need to decide on a meaningful slug size, immediately giving up the “as short as possible” feature.
  • You can have many collisions per successful insert when the database get filled, ending up with infinite collision loop when the all the slugs of the desired size are used.
  • That’s why it requires a regular maintenance. Someone needs to check the collision metrics and adjust the slug size manually.

I don't like it. Upredictable collision number and the need for maintenance immediately make it a tech debt.

Incremental Counter Approach

If these collisions seem scary, you can try to avoid them completely by removing any randomness from the algorithm. Instead, have an incremental counter increasing on every insert, and shorten the counter value with a number-to-text encoding similar to base62.

Characteristics of this approach:

  • We start with “as short as possible” slugs and grow them organically when the database size increases.
  • Higher requirements to the database because of the counter, especially if you’re going to use incremental IDs for that.
  • No manual maintenance required. Slug sizes grow over time automatically.
  • Counter values are not slugs, so you either need to have a two-step insert process (get the next counter value + encode the slug), or make a text-to-number decoder in the redirection endpoint.
  • A single counter means we cannot insert two URLs in parallel.
  • We might still have collisions with a two-step insert process when two URLs are inserted at the same time.
  • Depending on the number-to-text encoder, you might skip zero-padded slugs like “0001”.

I don't like it. Unnecessary complexity and high requirements for the database make in not elegant.

Random Characters Approach

We can get the benefits of both approaches by increasing the random slug size on collision. It might look somewhat like this:

  1. Start with an empty string as a slug.
  2. Add a random character to the slug.
  3. Try saving it to the database.
  4. If slug already exists, go to step 2.

Characteristics of this approach:

  • Low requirements to the database, any key-value store with a collision check would suffice.
  • We start with “as short as possible” slugs and grow them organically when the database size increases.
  • The number of collisions per insert is effectively limited to a logarithm of the database size unless you’ve got a broken random number generator.
  • No manual maintenance required. Slug sizes grow over time automatically.
  • But as the database gets filled, we can manually increase the minimal slug size to reduce the number of collisions per insert.

I like it. It's simple and it "just works" in predictable manner.

Conclusion

Slug generation seems like a small utility function but it has long influence on what database we use and whether someone needs to look after the service. Sometimes, architectural decisions are hiding inside such small utility functions. You should pay attention to them, even if it feels like bikeshedding.


Tags:

Comment this post by replying to this toot with you Fediverse/Mastodon account.