Saturday, October 24, 2009

SixDegreeSteam: Programmatic

First of all, thanks to everyone who have subscribed to my Devlog, sent me emails, or have otherwise expressed great interest in SixDegreeSteam. You all serve as my motivation, and I am very proud to be so highly regarded. Thank you all! Now, as promised, here is another SDS tech article.

From the conception of the project, SixDegreeSteam has changed fundamentally a couple times. The reason for this is because there are so many factors to consider that it is near impossible for me alone to take them all into effect. I've sat down and worked out the best approaches I can conceive, but after putting them all into working code, something new will show itself and justify a rethinking. The major issue here is that there are two conflicting, equally and critically important aspects to this project; bandwidth and memory usage.

Bandwidth and memory are quasi-mutually exclusive. In other words, you can not optimize one in a program that utilizes both technologies without negatively impacting the other. The reason for this is simple. In SixDegreeSteam, with the exception of static and very few dynamic data, anything that we transfer (bandwidth usage) is useless to the crawler and is therefore unnecessary to keep in memory (memory usage). Likewise, anything kept on the client side for a longer time, indefinately, or otherwise in a non-time-critical manner is not likely to be needed by the server, hence memory on the client is used in favor of bandwidth. While this trade-off does have its implementations, bandwidth is a much more precious commodity in the long run. Memory can be reclaimed by throwing away data, but bandwidth comes in limited amounts (for most of us) on a monthly basis. However, the more memory available to the program, the faster and larger the resulting dataset can be. I've had to rewrite the messaging protocol between the server and client once, the crawler module thrice, and the program base twice as a result of trying to find the best trade-off between the two technologies. I have a sick feeling there will be a few more rewrites to come, and many more after the project goes public.

The bandwidth-versus-memory conundrum has led to a change in the structure of the program as a whole. In previous articles and discussions, I spoke of a multi-threaded program layout consisting of a server and a client (slave threader), with the client managing "slave threads" responsible for the actual crawling operations. As of this writing, multi-threading is still a part of the server aspect in order to allow the server to communicate with multiple clients at once. But, the client has been changed from the slave threader approach to a single-thread, single-crawler, tree-based data collection scheme. The client starts by requesting a Steam profile to begin crawling from the server. Then, an object at the top of a tree data structure is created to represent the profile. The object then crawls the ID, friends, and groups of the profile, creating leaf objects for each in the data structure as they are found. After the data structure reaches a set capacity, it's compressed and sent to the server. The client then removes the obsolete data and continues crawling profiles not yet completed. Two questions arise from this approach.

First, how do we make sure we aren't wasting memory and a small amount of bandwidth by crawling profiles the server already has on record? The simple answer: by wasting an even smaller amount of bandwidth between the client and server to ask if the server needs the profile. When a profile is queued for crawling on the dataset, the client will send a message to the server containing the profile ID. The server will then respond with a message stating whether to continue crawling or remove the object from the queue. Herein lies the problem outlined before by our memory-bandwidth trade-off. On one hand, we could save bandwidth between the client and server by just crawling all objects put on the stack at the expense of the memory the objects would consume. On the other, we could save that memory at the expense of bandwidth to ask the server if all the profiles we queue have already been crawled. The problem is made even more complex by the thought that any profiles we remove from the dataset after sending to the server or after discovering it has already been crawled will not be known to the client in the future. The possibility of the profile resurfacing is almost definate (which is the purpose of the project to find out!). So, question two arises.

Two, how do we stop the client from crawling profiles more than once? As the answer to question one pointed out, we have the opportunity to just ask the server if we have crawled it before or not. But, this will inevitably turn out to be a gigantic waste of bandwidth because we will have to ask the server the same question more than once to receive the same answer. The alternative is to implement a low-impact cache, capable of quickly and compactly storing/recovering which profiles the client has been instructed are crawled. And, you guessed it, the trade-off is presented. The cache increases memory usage to relieve bandwidth usage.

This is a small glimpse into the current layout of the program. It most likely does not do justice to the project complexity, but should work to give you an idea of how the program will work and why it's taking me so long to complete it.

No comments: