Sunday, January 31, 2010

SDS Status Report: January 10

A full month of crawling has passed, leaving the database pretty full and everyone involved with the project ecstatic to see everything coming together. A lot of progress has been made since the last report, so there is a lot to report on here. Strap in!

On January 3, this discussion took place:

January 3, 2010 chat between Blake "ROOT" and Harry "BlackSyte":

Blake: SDS is running a lot slower now that things are filling up, and the partitioning [of the database in an attempt to make things faster] didnt go over too well. My next attempt will be to do manual partitioning, but I dont want to do it right now because I just worked pretty hard to repair the damage the automatic partitioning did...It slowed things down to 1 process per minute.

With the current table loads and everything back to normal, its about 50 profile processes per minute. Its not THAT bad, but it isnt fast enough. Something exciting, though; SDS has reached the edge nodes. The very first Steam user has been crawled, and the very last Steam user (as of this writing) has been crawled. All that is left is everyone in between! But, it stands to prove that my theory might hold weight afterall. It shows that someone, thereby everyone, in the SDS Steam group is connected to both the minimum and maximum edges.

On January 18, SixDegreeSteam Local Server 1.5.87C (the current version) was implemented. It is thus far the most stable release with many improvements over the initial release from December. Family 1.5 brought in better error handling, a logging mechanism, parameter-based execution, a "prioritize child nodes" option for high-priority queue entries, and some programmatic fine-tuning. Release 1.5.87 introduced multi-threading for running multiple local servers (crawlers) consecutively and fixed some profile tracking issues that were previously irreproducible in production.

On January 21, the client algorithm was successfully executed. The interface is still a couple weeks out, but is coming quickly. Work on the interface was stalled after the 21st due to coursework and may continue to be stalled for 4 more weeks.

As of this writing, the database reports these statistics:

- Users crawled: 2,518,356
- Groups crawled: 748,140
- Profiles pending: 5,988,765
- Total discovered users: 7,791,431
- Total discovered groups: 748,141
- Average time between discovery and crawling: 9 days

Bonus: This is a discussion that took place on January 5. It doesnt do much to prove progress of the project, but does offer some interesting food for thought:

January 5, 2010 chat between Blake "ROOT", Dallas, and Harry "BlackSyte":

Blake: Its amazing how quickly the graph edges progressed.
Dallas: But the surface has only been barely scratched.
Blake: Exactly! I mean, we arent even half way through all the users, yet a streamlined backbone has emerged. That raises a concern...Perhaps there are more profiles with a betweenness centrality less than or equal to their degree centrality than I thought. Since a rigid, well-defined backbone has already formed so early in the project, yet there are relatively no user discoveries, it makes me think[...]

There are two possible conditions under which this would happen, guessing that the crawler itself is not at fault, and my confidence of that is fairly high as of 1.5C. First [possibility], there are a shitload of people who form "clicks" [or] small collections of friends who do not join groups or befriend "outsiders"...and by shitload, I mean over 80% of the entire demographic. Thats hard to believe, but not improbable.

The second case, which is even mmore unlikely but is very possible given a generational standpoint, is that there are sections of the entire demographic who are only friends with other members of the same section. So, you have 4 million users in segment A who are friends with other users in that same section, but none of them are friends with users from section B.Its incredibly unlikely, but its a valid portrayal of an existing graph theory called generational demography -- Newer users tend to be friends with other newer users, while older users tend to be friends with other older users, and never shall the two meet.

Heres another interesting observation Ive made. The queue timeframe is currently 9 days...Now, what that means is that there is a 9-day waiting period between discovery and crawling. Its a common occurrence in the crawler that a profile will become invalidated within those 9 days; A wildly common occurrence, in fact. People are deleting their profiles or changing their profile names way too often, somewhere between 1 to 9 days!
Harry: Does that hamper the crawling process?
Blake: In the first release, yes. The crawler would actually crash with a fatal error because it was expecting the profile to be there, but it wasnt. That would happen in the first couple days of launch before I patched it, which is pretty funny because that means the profiles were becoming invalidated within 1-3 days!

No comments: