Blake's Coding Blog: September 2009

Saturday, September 12, 2009

SixDegreeSteam Official Group Open

The official SixDegreeSteam group (SixDegree$team) is now open for all interested Steam users to join. While project news and developmental goodies will be pushed here first, the SDS Steam group will be updated with important project updates only. Joining is a good way to get involved and show your support if you use Steam. Being a member of it will also grant you bragging rights when/if the project becomes mainstream, because members of the group will be the first to be crawled when we launch!

Thursday, September 10, 2009

SixDegreeSteam: Challenges

There are a few caveats to consider when developing this project. I have laid the information out in a (hopefully) easy-to-follow situation-caveat format. An important thing to remember is that not all caveats have an inherent solution, and the best solution is not always the most favorable.

Situation: First and foremost is the way Steam Community presents the information we are hoping to use. Since all information will be acquired by following "tips" from one page to a handful of others, and continuing on this pattern until all pages are analyzed, we call the process crawling. It works similar to a crawler for a search engine; Where a search engine crawler follows links on one page to others, our crawler will read IDs from one page and create new links pointing to sequential pages. Crawling friend information is easy, as the entire list of friends for a profile is in an HTML file. All we have to do is download the file from the site, extract all friend IDs, and queue them for crawling. Similarly, there is an XML file for each user containing profile information, including the groups the user is a member of. As far as crawling groups goes, a set of XML files for each group exists that lists all member IDs (one XML file per 1000 members). Note that the XML file we use to get group memberships of a user is not ideal, since it contains a lot more information than we need. This means that the speed of download and bandwidth usage (on both the crawling server and the Steam Community network) are impacted unnecessarily. A solution to this may be to use the Steamworks API available from Valve to developers, which features the ability to directly access community information without the need to crawl community pages, thus reducing bandwidth usage and overhead. However, Valve has declined granting me usage rights to Steamworks.
Caveat: Steam users have the ability to set privacy options that affect who is allowed to view their profiles. The current privacy options are private, public, and friends-only. Our crawler will only be able to use public profiles. Something to note, though, is that groups are always public, therefore all members are visible. This means that we can at least gather group membership information for non-public profiles. Likewise, if the private user is a friend of a public user, we will still be able to link the two users in our dataset since friendship on Steam is mutually inclusive (user A cannot have user B on their friend list without user B also having user A on theirs).

Situation: As was previously stated, the project is very web-intensive, utilizing a crawler technique to gather our data. This is a massive undertaking, requiring a huge amount of page/file downloads and subsequent parsing. While it is impossible to give an exact number to represent the size of our resulting dataset, the number of files we will need to download can be represented by a simple formula. We need at least one page per user (the profile information XML file, which also contains group membership information). This page is used to tell if the profile is public by checking to see if we get an access error, data is missing, or the privacy setting is explicitly shown. If its private, it would be wasteful to download the next file, as we dont have permission to view it and will just get a similar error to the first file. If its publicly viewable, we download a second file (friend list HTML) to complete our data collection for that user. Groups are a lot simpler as they are always public and will always have at least one member, therefore at least one file (the member list XML). Our formula to represent how many files we will need to download is A + B + (C * ceil(D/1,000)), where A is the number of Steam users, B is the number of users with public profiles, C is the number of groups, and D is the average number of members per group. As of this writing, there are over 53,300,212 Steam users and 1,101,806 groups. Guessing that at least three quarters of all users have public profiles and every group has 2,000 members (in reality, some users are only in one group where others are in one hundred, and some groups have one member where others have one million. 2,000 members per group is the best on-the-spot average estimation I can come up with for now), our file count is 53,300,212 + (53,300,212*(3/4)) + (1,101,806 * ceil(2,000/1,000)) = 53,300,212 + 39,975,159 + 2,203,612 = 95,478,983.
Caveat: To put that into data consumption terms, a member list for a group that has 1,000 members is about 42 KB, a friends list containing 50 friends is about 40 KB, and a complete profile page (used to get group memberships) displaying 10 group memberships is about 14 KB. Its very rare that a single file will be greater than 42 KB. Also, the user profile XML is what we use to tell if a profile is private or not, so its important to point out that the file returned for a private profile is about 366 B. So, modifying our previous formula to calculate bandwidth usage, we get (53,300,212 * 0.36) + (39,975,159 * (40 + (14-0.36))) + (2,203,612 * 42) = 2,256,007,309 KB = 2.10 TB. That is a serious amount of data to be transferring! We wont keep all of it, but as pointed out in Caveat 1, this is how we have to collect the data for lack of an alternative.

Situation: Caveat 2 points out that there is a lot of data transfer involved. So, bandwidth usage is definately an issue. Naturally, processing time for the data is also a factor. We cant provide accurate estimates for how long the data parsing for each of the three files would take because some computers are faster than others, processing time changes as memory becomes/is no longer available, etcetera. In either case, the obvious solution to speeding up both crawling and processing, as well as spreading bandwidth usage, is to use more than one computer. To speed processing up even further, we can create more than one instance of the program and have each instance work on a different page concurrently. This means we can process ten or more pages in the time it would take a single instance to process one. I will discuss both of these in-depth in a later article.
Caveat: Creating a control structure that can manage multiple instances (called threads) is one thing. Creating a server program that can remotely manage multiple programs running multiple instances is a completely different thing. The architecture of the server and its complimenting "slave threaders" will have to be such that each slave threader is capable of working, to an extent, on its own. The only time the slave threader should call home to the server is when the global scope of the dataset needs to be analyzed or all threads are idle (more work is needed). We adopt this minimalistic communication policy to make sure the bandwidth usage of the data between the two components does not become a factor in itself. This is a very intrinsic topic that requires extra thought in its own. As such, a future article will be devoted to this topic alone.

These are only a handful of issues we need to consider when tackling this project. They represent the major challenges I am facing right now in the design of the program. In an upcoming article, Caveat 3 will be examined more closely to show how I intend on structuring the server-slave threader relationship and my implementation of thread pooling.

Edited: This post was updated for correctness on January 14, 2010.

Wednesday, September 09, 2009

SixDegreeSteam: The Intro

As my previous article hinted to, I am planning to write a program that expresses the link between every Steam user. As of this article's writing, there are exactly 53,300,212 accounts registered with Steam. This includes active, inactive, banned, duplicate, publisher/staff, and Internet Cafe ("multiseat") accounts. It is impossible to distinguish just how many of those accounts are actually worth the effort of this project without walking through each account profile (ironically, that's just what this project sets out to do). Each of those accounts is allowed to have a maximum of 255 friends, and an unknown (perhaps unlimited?) amount of group memberships. Now that we have some numbers down, let's talk about the actual program.

Introducing SixDegreeSteam, an attempt at mapping the Steam social network. The name is a reference to the Six Degrees of Separation sociological theory that the project is based on, which states that every person in the world is linked to every other person by no more than six people. My theory is that a similar phenomenon exists between all social mediums, specifically the Steam platform's underlying network Steam Community. However, there is a fundamental difference between Steam and other social networking mediums; Some people only use Steam to game, not make friends! This means that some accounts will have 100 friends and be members of 10 different groups, while others will have 0 friends and 0 groups. The network is even more obfuscated because some people may have friends and group memberships, but only with people they know well. This makes the web of connections incredibly shallow and in severe cases exclusive, with no connections to the mainstream web or other exclusive webs. In other words, "Everybody knows everybody" only works if "everybody" is not limited to a select few.

There are also a couple technical caveats we must address. They will be discussed in an upcoming article.

Monday, September 07, 2009

Six Degrees of Inactivity

Where did you go??? It's been months since your last post! Are you working on anything interesting?

- Anonymous

First off, sorry for the inactivity, guys. Ive been insanely busy lately with a lot of stuff, mostly rectifying a couple situations in my personal life. Now, between college work and my new job, I dont have much time to myself. What little time I do have, I use to play games with friends or work on little side projects. Because of this, I wouldnt expect to hear too much from me for a while. Feel free to email me to say hi or catch up if youd like! I check my email daily when I get home from work.

Having said that, I do have a major project Im in the middle of conceptualizing. Circumstances providing, Ill make an in-depth post about it later. For now, the singular concept with have to suffice. Have you ever heard of the Small World Phenomenon, or more to the point, the Six Degrees of Separation model? They are very interesting ideas that attempt to link every person in the world. The idea was built on to create some popular games like Six Degrees of Kevin Bacon, Six Degrees of Wikipedia, and even a Facebook application. Surprisingly, there is no implication of this idea for the wildly popular Steam platform. Well, I think its about time someone puts an end to that, and who is more qualified to do so than me? :)

The goal of the project is to create a mapping showing how every Steam user is linked to everyone else. The project will take into account friends, as well as group memberships. This is a very exciting project for me and a major undertaking. More information to come when/if I get the time!

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Blake's Coding Blog

Saturday, September 12, 2009

SixDegreeSteam Official Group Open

Thursday, September 10, 2009

SixDegreeSteam: Challenges

Wednesday, September 09, 2009

SixDegreeSteam: The Intro

Monday, September 07, 2009

Six Degrees of Inactivity

About Me

Links

Blog Archive

Google Ads