Blog
Words about design, the industry, and everything in between.

URL Shortener Length

Posted on

I made a small URL shortener for Oswald at osw.li in an hour using PHP and MySQL, but I want to learn the MEAN stack, so I thought that this could be a fun starter project. One interesting decision was to decide how many characters the shortened URL’s slug be.

There can be 64 possible characters: A to Z, a to z, 0 to 9, – and _. Even if we make a 3-digit slug, there can be 64^3 = 262,144 possible URLs, which is a big number. The trouble happens with collisions, though. After how many URLs would a pseudorandom generator have repetition? I wrote some JavaScript to find out.

It essentially creates slugs until they’re repeated and returns the number at while repetition happened. It does this 10,000 times and logs the average.

function randomString(length) {
	var result = "", chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-";
	for (var i = length; i > 0; --i) result += chars[Math.floor(Math.random() * chars.length)];
	return result;
}
for (k = 0; k < 5; k++) {
	var bosarr = [];
	for (j = 0; j < 10000; j++) {
		var arr = [];
		var duplicate = 0;
		do {
			var num = randomString(k);
			var check = 0;
			for (i = 0; i < arr.length; i++) {
				if (arr[i] == num) {
					check++;
				}
			}
			if (check == 0) {
				arr.push(num);
			} else {
				duplicate++;
			}
		} while (duplicate <= 0);
		bosarr.push(arr.length);
	}
	var sum = 0;
	for (i = 0; i < bosarr.length; i++) {
		sum += parseInt(bosarr[i], 10);
	}
	var avg = sum / bosarr.length;
	console.log("For a " + k + "-digit string, there will be repetition after " + avg + " strings");
}

For a 3-character slug, repetition happened at around 640. This means that after around 600 shortened URLs, we would have to re-generate a slug. For a 4-character one, it was around 5,000. And for 5, it was around 40,000.

Of course, we’ll also check if the slug exists, but to (mostly) avoid it, a 5-character slug makes most sense. 40,000 URLs before we have to ever re-generate. Interesting.

Footnote: If the odds are that repetition happens after 40,000 URLs, do we really have to send in a database query every time to check? And if we’re doing that, why not stick to 4-character ones? They’re shorter, and there can be over 16 million possible URLs. I pick 5-character because the probability gets reduced by over 6 times by adding one character, but 4 isn’t too bad if we’re checking anyway.