Thursday, September 10, 2009

SixDegreeSteam: Challenges

There are a few caveats to consider when developing this project. I have laid the information out in a (hopefully) easy-to-follow situation-caveat format. An important thing to remember is that not all caveats have an inherent solution, and the best solution is not always the most favorable.

Situation: First and foremost is the way Steam Community presents the information we are hoping to use. Since all information will be acquired by following "tips" from one page to a handful of others, and continuing on this pattern until all pages are analyzed, we call the process crawling. It works similar to a crawler for a search engine; Where a search engine crawler follows links on one page to others, our crawler will read IDs from one page and create new links pointing to sequential pages. Crawling friend information is easy, as the entire list of friends for a profile is in an HTML file. All we have to do is download the file from the site, extract all friend IDs, and queue them for crawling. Similarly, there is an XML file for each user containing profile information, including the groups the user is a member of. As far as crawling groups goes, a set of XML files for each group exists that lists all member IDs (one XML file per 1000 members). Note that the XML file we use to get group memberships of a user is not ideal, since it contains a lot more information than we need. This means that the speed of download and bandwidth usage (on both the crawling server and the Steam Community network) are impacted unnecessarily. A solution to this may be to use the Steamworks API available from Valve to developers, which features the ability to directly access community information without the need to crawl community pages, thus reducing bandwidth usage and overhead. However, Valve has declined granting me usage rights to Steamworks.
Caveat: Steam users have the ability to set privacy options that affect who is allowed to view their profiles. The current privacy options are private, public, and friends-only. Our crawler will only be able to use public profiles. Something to note, though, is that groups are always public, therefore all members are visible. This means that we can at least gather group membership information for non-public profiles. Likewise, if the private user is a friend of a public user, we will still be able to link the two users in our dataset since friendship on Steam is mutually inclusive (user A cannot have user B on their friend list without user B also having user A on theirs).

Situation: As was previously stated, the project is very web-intensive, utilizing a crawler technique to gather our data. This is a massive undertaking, requiring a huge amount of page/file downloads and subsequent parsing. While it is impossible to give an exact number to represent the size of our resulting dataset, the number of files we will need to download can be represented by a simple formula. We need at least one page per user (the profile information XML file, which also contains group membership information). This page is used to tell if the profile is public by checking to see if we get an access error, data is missing, or the privacy setting is explicitly shown. If its private, it would be wasteful to download the next file, as we dont have permission to view it and will just get a similar error to the first file. If its publicly viewable, we download a second file (friend list HTML) to complete our data collection for that user. Groups are a lot simpler as they are always public and will always have at least one member, therefore at least one file (the member list XML). Our formula to represent how many files we will need to download is A + B + (C * ceil(D/1,000)), where A is the number of Steam users, B is the number of users with public profiles, C is the number of groups, and D is the average number of members per group. As of this writing, there are over 53,300,212 Steam users and 1,101,806 groups. Guessing that at least three quarters of all users have public profiles and every group has 2,000 members (in reality, some users are only in one group where others are in one hundred, and some groups have one member where others have one million. 2,000 members per group is the best on-the-spot average estimation I can come up with for now), our file count is 53,300,212 + (53,300,212*(3/4)) + (1,101,806 * ceil(2,000/1,000)) = 53,300,212 + 39,975,159 + 2,203,612 = 95,478,983.
Caveat: To put that into data consumption terms, a member list for a group that has 1,000 members is about 42 KB, a friends list containing 50 friends is about 40 KB, and a complete profile page (used to get group memberships) displaying 10 group memberships is about 14 KB. Its very rare that a single file will be greater than 42 KB. Also, the user profile XML is what we use to tell if a profile is private or not, so its important to point out that the file returned for a private profile is about 366 B. So, modifying our previous formula to calculate bandwidth usage, we get (53,300,212 * 0.36) + (39,975,159 * (40 + (14-0.36))) + (2,203,612 * 42) = 2,256,007,309 KB = 2.10 TB. That is a serious amount of data to be transferring! We wont keep all of it, but as pointed out in Caveat 1, this is how we have to collect the data for lack of an alternative.

Situation: Caveat 2 points out that there is a lot of data transfer involved. So, bandwidth usage is definately an issue. Naturally, processing time for the data is also a factor. We cant provide accurate estimates for how long the data parsing for each of the three files would take because some computers are faster than others, processing time changes as memory becomes/is no longer available, etcetera. In either case, the obvious solution to speeding up both crawling and processing, as well as spreading bandwidth usage, is to use more than one computer. To speed processing up even further, we can create more than one instance of the program and have each instance work on a different page concurrently. This means we can process ten or more pages in the time it would take a single instance to process one. I will discuss both of these in-depth in a later article.
Caveat: Creating a control structure that can manage multiple instances (called threads) is one thing. Creating a server program that can remotely manage multiple programs running multiple instances is a completely different thing. The architecture of the server and its complimenting "slave threaders" will have to be such that each slave threader is capable of working, to an extent, on its own. The only time the slave threader should call home to the server is when the global scope of the dataset needs to be analyzed or all threads are idle (more work is needed). We adopt this minimalistic communication policy to make sure the bandwidth usage of the data between the two components does not become a factor in itself. This is a very intrinsic topic that requires extra thought in its own. As such, a future article will be devoted to this topic alone.

These are only a handful of issues we need to consider when tackling this project. They represent the major challenges I am facing right now in the design of the program. In an upcoming article, Caveat 3 will be examined more closely to show how I intend on structuring the server-slave threader relationship and my implementation of thread pooling.

Edited: This post was updated for correctness on January 14, 2010.

No comments: