Hi!
> Julian and Joseph Mack and I ran through this discussion in gruesome
> detail offlist, with me trying to prove that any kind of memory based
> persistence selection would fail in a highly loaded web cache
> environment in a very short time.
Memory based - that means, you want to have a list with every
ever used IP-address to ensure locality? Hey, that's impossible.
You have to use a hashtable.
It might work with only limited client population, but there are
many research-papers that show that caching only lives with big
client population.
I had a trace with more that 33.000 DIFFERENT requests per day. You'll
have much much more on a big system like here in France with a
several GBps backbone.
You'll have to make a tradeoff between time needed and RAM needed.
You can have all those addresses in an unsorted array with 5 bytes
per entry, or a sorted list with two additional pointers for each
entry, or you might want to make a tree, which will needs much more
memory.
> I don't think I ever convinced Julian
> that this isn't the same as static routes or that a new selection
> algorithm was needed, but I'll be happy to collect all of those mails
> and send them to anyone who is interested.
Yeah. I'm interested.
> The fact is, as Thomas has tried to explain, that the best solution to
> the web cache balancing problem _is_ a hash based selection algorithm.
> Because of the unique load presented by web caches, nothing else makes
> sense.
Well, in fact Wensong's algorithm made me think at first. And it might
work better than mine, if the "aging"-problem is solved and only one
director is used. But
> butter of web caches...the navigation images that never change and keep
> getting served from the web cache...they'll end up spread across all
> caches. Wasting that much bandwidth needlessly.
True.
> > > 3. Hash method is not good.
> >
> > Like 1?
>
> I think he means the actual hash. I don't know enough about hashes to
> counter. If that's a problem, it is easily solved...there are about a
> million good hashes out there in the public domain.
Actually, we use CRC16. In a master thesis before, one compared a bunch
of hashing funcitons and found it the best. Why? Oh yes! Further small
diadvantage of wensong's idea:
Imagine that a cache breaks down. DOS attack or just windows2000, which
is the same in my eyes.
The clients still want their data. They press their "reload"-buttons.
All the requests are sent to the least loaded cache, until it's
overloaded, then the requests are, again sent to the new least loaded
cache.
When a cache goes down, it's traffic will NOT be equally distributed
among all others. And when it comes up again, it will NOT receive the
requests for the data that are still on its disks.
This can't be achieved by many hashing algorithms. If you use something
with "mod", most of the requests will be assigned to other caches when
the number of caches changes.
Another point is if the hashing algorithm hashes smoothly. If you
feed the algorithm with the IP-addresses, it may happen, that they
are not equaly distributed, because every 4th character is a "."
point.
For CRC16 it doesn't matter. Moreover, CRC16 is much faster than
MD5, which would have similar properties.
Thomas
|