Sunday, January 31, 2010

SDS Status Report: January 10

A full month of crawling has passed, leaving the database pretty full and everyone involved with the project ecstatic to see everything coming together. A lot of progress has been made since the last report, so there is a lot to report on here. Strap in!

On January 3, this discussion took place:

January 3, 2010 chat between Blake "ROOT" and Harry "BlackSyte":

Blake: SDS is running a lot slower now that things are filling up, and the partitioning [of the database in an attempt to make things faster] didnt go over too well. My next attempt will be to do manual partitioning, but I dont want to do it right now because I just worked pretty hard to repair the damage the automatic partitioning did...It slowed things down to 1 process per minute.

With the current table loads and everything back to normal, its about 50 profile processes per minute. Its not THAT bad, but it isnt fast enough. Something exciting, though; SDS has reached the edge nodes. The very first Steam user has been crawled, and the very last Steam user (as of this writing) has been crawled. All that is left is everyone in between! But, it stands to prove that my theory might hold weight afterall. It shows that someone, thereby everyone, in the SDS Steam group is connected to both the minimum and maximum edges.

On January 18, SixDegreeSteam Local Server 1.5.87C (the current version) was implemented. It is thus far the most stable release with many improvements over the initial release from December. Family 1.5 brought in better error handling, a logging mechanism, parameter-based execution, a "prioritize child nodes" option for high-priority queue entries, and some programmatic fine-tuning. Release 1.5.87 introduced multi-threading for running multiple local servers (crawlers) consecutively and fixed some profile tracking issues that were previously irreproducible in production.

On January 21, the client algorithm was successfully executed. The interface is still a couple weeks out, but is coming quickly. Work on the interface was stalled after the 21st due to coursework and may continue to be stalled for 4 more weeks.

As of this writing, the database reports these statistics:

- Users crawled: 2,518,356
- Groups crawled: 748,140
- Profiles pending: 5,988,765
- Total discovered users: 7,791,431
- Total discovered groups: 748,141
- Average time between discovery and crawling: 9 days

Bonus: This is a discussion that took place on January 5. It doesnt do much to prove progress of the project, but does offer some interesting food for thought:

January 5, 2010 chat between Blake "ROOT", Dallas, and Harry "BlackSyte":

Blake: Its amazing how quickly the graph edges progressed.
Dallas: But the surface has only been barely scratched.
Blake: Exactly! I mean, we arent even half way through all the users, yet a streamlined backbone has emerged. That raises a concern...Perhaps there are more profiles with a betweenness centrality less than or equal to their degree centrality than I thought. Since a rigid, well-defined backbone has already formed so early in the project, yet there are relatively no user discoveries, it makes me think[...]

There are two possible conditions under which this would happen, guessing that the crawler itself is not at fault, and my confidence of that is fairly high as of 1.5C. First [possibility], there are a shitload of people who form "clicks" [or] small collections of friends who do not join groups or befriend "outsiders"...and by shitload, I mean over 80% of the entire demographic. Thats hard to believe, but not improbable.

The second case, which is even mmore unlikely but is very possible given a generational standpoint, is that there are sections of the entire demographic who are only friends with other members of the same section. So, you have 4 million users in segment A who are friends with other users in that same section, but none of them are friends with users from section B.Its incredibly unlikely, but its a valid portrayal of an existing graph theory called generational demography -- Newer users tend to be friends with other newer users, while older users tend to be friends with other older users, and never shall the two meet.

Heres another interesting observation Ive made. The queue timeframe is currently 9 days...Now, what that means is that there is a 9-day waiting period between discovery and crawling. Its a common occurrence in the crawler that a profile will become invalidated within those 9 days; A wildly common occurrence, in fact. People are deleting their profiles or changing their profile names way too often, somewhere between 1 to 9 days!
Harry: Does that hamper the crawling process?
Blake: In the first release, yes. The crawler would actually crash with a fatal error because it was expecting the profile to be there, but it wasnt. That would happen in the first couple days of launch before I patched it, which is pretty funny because that means the profiles were becoming invalidated within 1-3 days!

Thursday, January 14, 2010

SixDegreeSteam: Synopsis

I was curious as to a means that people like myself could come to an understanding on roughly how this program works... I understand what the goal of it is, but how do you intend to do this?...

- TopRaman

While I have explained a lot of SixDegreeSteam's abstract mechanics in previous articles, I feel they were not programmatically detailed enough to give a good picture of how exactly the program operates. So, Ill try to fill that gap here. There are three components to the project; Ill try to discuss all of them as simply and cleanly as possible.

WARNING: Previous articles have focused on previous versions of the project. Likewise, this article with focus on the current version of the project as of this writing (1.5C). I feel this current version is the most efficient, so it is likely to stick around for a long time. But, this is a disclaimer just in case the project does change again substantially.

The first and (thus far) most time-consuming component is the crawler. Without it, the project would not have a dataset. The crawler has the straight-forward job of collecting links to other profiles from one profile, saving (queueing) those links, then opening them back up sequentially to collect more links. The process continues until all links are collected and thereby all profiles are analyzed. This process of collecting, storing, reading, and repeating is called crawling. The specific crawling logic will not be described here for the sake of brevity, but all the details can be found in an earlier post titled SixDegreeSteam: Challenges. Program-wise, though, the crawler downloads the pages containing the links, extracts the links using a combination of Regular Expressions and XML parsing, and inserts them into a SQL database table called the crawler queue. The crawler also extracts some basic profile information, such as SteamID, profile name, and avatar, in a similar manner and inserts it all into another table called the user dataset (or group dataset). The complete list of datums stored per user is SteamID, last crawl time (to prevent recrawling a user too frequently), profile name, avatar, friends, and group memberships. The complete list of datums stored per group is similar: SteamID, last crawl time, group name, avatar, and members.

Given that the crawler does its job correctly, we are left with a database filled (and I mean FILLED) with information about the users and groups of the Steam Community. Oddly enough, the information is pretty inconsequential without a way to harness the datasets. So, our second component, aptly named Pathfinder, is arguably as important as the crawler. Pathfinder has the soul purpose of using the information available through the database to calculate a lowest-cost path between two given nodes. To do this, Ive opted for an object-oriented rendition of the breadth-first search graph theory algorithm. Once again, for the sake of brevity, I wont go into detail about the algorithm. If you are curious about it, follow the link or run a Google search. There are tons of articles that have covered it far better than I can. After the algorithm is applied, a list of users and groups used to reach profile B from profile A the quickest is displayed with links to each profile and avatars for recognition purposes.

The third and newest addition to the component set is a sort of network browser. Using a graphical, navigable web of the dataset, users will be able to quickly traverse the entire Steam Community social network, allowing them to get a better idea of just where they fit into it all. This component is simply an auxiliary to the project and is not a real focus. As such, it will be the last to be implemented and will only even enter the development picture after the crawler and Pathfinder are thoroughly completed.

As was said about Pathfinder, the information gathered by the crawler is pretty useless to the end user until a method of putting it to work is created. This explains why the crawler has been operational for nearly a month now, yet the site is still empty.