Do’s and Dont’s for scaleable image hosting

If you’ve been reading my web technology blog, know what I’m up to or general stalking me in any other way then you’ll know I have been getting my hands dirty with hosting artwork and photography images for a number of years now. And from this experience I’ve decided to distill the following do’s and don’t list.

While this list is mainly geared towards high load situations, a lot of these points can easily be applied to sites of any scale, and you never know when your quite little blog is going to be hit by digg…


Don’t serve images via application (i.e. php etc.)

The keyword here is serve

I think it’s totally cool to use an on the fly image generation script, wanna tweak all the image sizes on your mega site by 1pixel, no problem: step 1) change a line in a config file step 2) no step 2! . In fact that’s the way I generally do things.

However there’s absolutely no reason why you should call this script more than one time per image. Yes I know you can handle caching in the script, but why bother? Write the images to disk and let your webserver do the dirty with caching etc. The way to do this is to ‘lazy load’ this script i.e call the script once when the first time it is requested and the rest of the time the image file is served as normal – lazy load means your server doesn’t get clogged up for hours regenerating all your images, they are only regenerated when requested. Anyway this is simple enough todo with a 404 rule or Url rewriting:

Diagram: Generate images with a script, but serve with apache (see tips below)
on the flyer image generation script


You can use a 404 script to send missing images, here’s how with a .htaccess file:

ErrorDocument 404 /generate_image.php

or an apache url rewrite rule:

RewriteCond $1 ^image
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /generate_image.php/$1 [L]

Then in 404.php use the ‘REQUEST_URI’ server variable to figure out what the image is – hopefully you have some sort of unique key in the file name e.g.
/image/user/bob/23.jpg
and call your image generation script , here’s a brief php example (based on the 404 redirect).

//what image is this
$url = $_SERVER['REQUEST_URI'];

$bits = explode('/',$url);

if($bits[0] == 'image' && $bits[1] == 'user')
{
		$user = $bits[2];
		$image_id = $bits[2];
		generate_image_function($user,$image_id);

		//now redirect to the image etc.

}

Do use a CDN

Scatter your seeds.

A content delivery network helps speed up your page loads on many levels.

  • Images load faster if loaded from geographical near (to the user) server
  • Multiple server mean more images can load in parallel
  • It can free up resources on your application server (i.e. your website), as you can reduce the number of requests to the webserver, plus there’s no need to have a the webserver load all the bells and whislte your app needs (php modules etc.)

Poor Man’s CDN

A CDN sounds a bit scary and expensive, but if you simply move your images onto another server you’ll see a lot of the benefits already. As a simple poor man’s CDN you could do the following:

  • Set up few domains for images – subdomains will do images-a.mysite.com images-b.mysite.com images-c.mysite.com
  • Get hold of one or more other webservers (bog standard webhosting accounts will do)
  • Use your app or round robin dns http://en.wikipedia.org/wiki/Round_robin_DNS to randomly prepend your domains to your image urls e.g.

    http://images-a.mysite.com/image/user/bob/97892789.jpg

  • Copy/rsync your /image folder to the other server(s) with a cronjob
  • On the image servers have a 404/urlrewrite rule that sends the user back to the original server for images that aren’t yet copied across.
    .htaccess example:

    RewriteCond $1 ^image
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^(.*)$ http://www.mysite.com/$1 [L]

This will work even if all the domains are on the same server (but different to your app) and aliasing to the same directory, plus it sets you up ready for exansion later when you really do need to break out onto multiple servers.

Poor man’s content delivery network (for images)

Poor man’s CDN plan B – Amazon S3

Well this could perhaps be the plan A, as you’ll see in the point below S3 could act quite nicely as a CDN being a geographically dispersed set of redundant servers! I think my main concern would be the potential for the bandwidth costs to rise, they’ll be cheap enough to begin with, but it seems to me the cost scales closely with the bandwidth – if you compare 5TG monthly transfer: on S3 = $900 to dreamhost = included in basic plan.


Do host data sources (precious original images) on a third party infinite disk

Multiple redundancy, infitite scalability, gourmet almond croissants

If you’ve got some funky image auto regeneration going on, or you simply want to store higher resolution ‘originals’ then you’ll need a large warehouse and caretaker to look after then. Ideally multiple redundant replicated warehouses and caretakers in case one gets suicide bomber attacked etc.

When I started The Artists Web I realised storage would be the first bottle neck I’d hit, so I quitely hoped the problem would just go away. In fact it did just go away because one day I discovered Amazon S3, moreover I read on Don MacAskill’s blog that his company switched over in about a day. So did we.

Amazon S3 is essentially an infinite network disk, I won’t go in to any technical details here but it’s cheap, easy to set up and spreads your data around the world in multiple redundant locations. There’s plenty of libraries, command line tools and a few commercial services to boot.


Don’t store images in a database

This is a japanese maintenance and back up horror film.

When I was planning a few years back, I remember reading forum threads debating whether it was a good idea to store images in a database. The reason I liked the idea of storing images in a database is centralization of data – no need to manage image back up separately. Arguments against seemed to be more about having to do your own caching, use of resources etc. However, you can out put cache headers from your scripts, write the (thumbnail) files to the file system etc. to get round all these performance issuses.

Nice as the central backup idea is, the problem (as I learned the not so easy way) comes when you have more than a few hundred images – you have a Ghostbusters Marshmallow man crossed with a cow size database to back up. Even if you’re running backups off a replicated slave, your database is (most likely) many orders of magnitude larger than it needs to be which leads to all kinds of headaches. For instance disk space, restoring from backups or simply running a simple optimize table routine on a 100GB table is not my idea of a nice cron job (pun intended). And this feels somewhat like trying to swim with a pointless cow handcuffed to your ankle when the rest of your data is less than 48k (well, okay 480MB). There may be some reasons why ‘enterprise’ databases (oracle?) don’t mind this kind of cow sitting inside them, but I only have experience of ‘toy’ databases like mysql.

Don’t store your images in your database kids.


Do use a class/function to get image urls

Save headaches later on.

Your image url schema is probably simple, something like : /images/username/size_imageid_imagename.jpg. So is it really worth having a function/class to generate urls? Well it’s certainly not going to hurt and is going to save time later on: want to shift to a CDN? substitute smaller images for mobile devices? serve the image from a different server depending on geographic location of the user?, no problem, most of these could be done with a few lines of code added to the image url class.

//normal images
$image_url->get_image_url($image_id);
$image_url->get_user_avatar($user_id);

//mobile devices? piece of cake
$image_url->set_mobile_device();
$image_url->get_image_url($image_id); //returns url to smaller image size
$image_url->get_user_avatar($user_id); //ditto

Do use a config file for image sizes

If you’re following the above rule, then you’ll probably be doing this anyway:

class image_url ()
{
		var $tiny_thumbnail_size = 30;
		var $thumbnail_size = 200;
		var $image_size = 400;

		/* etc. */

}

Simply make sure you only need to specify your images sizes in one place.

That’s it (for now)

Well I’m sure I’ll be back to add some more points, in the meantime hope you find this a useful read and look forward to your feedback!

8 comments for “Do’s and Dont’s for scaleable image hosting

  1. April 20, 2008 at 7:23 am

    Implemented the first trick on OSF – I was serving images via a script which handled caching, but of course as you said it’s no point having the script handle this if the server can already tell you if a file exists or not. So now, if the file exists I serve the file normally, if not I serve the image via the script (which then creates the file for next time) – brilliant! Things are noticeably more snappy now.

    Cheers
    Jon

  2. April 23, 2008 at 2:43 am

    Bingo! Glad that’s worked for you Jon :)

  3. Managed Hosting
    August 30, 2008 at 6:37 pm

    Hi mate , great post ! I would like to subscribe to your blog, please tell me where can i subscribe. Thanks !

  4. May 28, 2010 at 3:30 am

    In order to determine if the image file exists at first, how exactly did you manage to handle the 404 response from Amazon S3?

  5. May 28, 2010 at 3:44 am

    In this example I’m only using S3 as the original data store, not as the CDN, so the 404 comes from the CDN server.

    Unfortunately you couldn’t use S3 for the CDN with this “poor man’s CDN” method. You’d have to pre-generate the image thumbnails and put them on S3 first as there’s no way to add .htaccess rewrite rules to S3.

  6. May 28, 2010 at 3:58 am

    Thanks for the quick response. Thats pretty much what I thought. Just spent an hour looking into it and your post here was the only glimmer of hope that it was possible. I think I’ll forgo the dynamic image sizing for now.

  7. July 3, 2012 at 11:10 pm

    Thanks for the dont’s because they are very important !

Leave a Reply

Your email address will not be published. Required fields are marked *