Blake's Coding Blog: 2009

Sunday, December 27, 2009

SDS Finished (For Real This Time!)

The centerpiece of the project, the SixDegreeSteam crawler/server, has been thoroughly completed. Since the promised launch date of December 25, the crawler was experiencing memory issues that would cause it to die roughly 5 minutes after startup. Im very pleased to announce that version 1.5.77C of the crawler (yes, it seriously took that many revisions to get everything up to working conditions) is now in effect, and running smoothly!

Now I can switch over to working on the client software. More about that will be released later. Here are some statistics about the limited runtime the crawler got between its crashes.

- Uptime: 2 hours
- Users crawled: 1,961
- Groups crawled: 10,337
- Users/Groups pending: 408,376

A special thanks to everyone who has sent me emails, joined the Steam group, and otherwise completely overwhelmed me with their gracious comments! Its truly inspiring to see so many people in support of what I invisioned to be a small personal project. As always, feel free to email me with questions, comments, concerns, or funny lolcats. blake.oxx@gmail.com (remove the dot)

Wednesday, December 16, 2009

SixDegreeSteam Finished

After about 3 months of development, SixDegreeSteam has reached the final steps in the launch process. The database has been set up, the crawler has been completed, and the folks at Valve have been sent "please let me borrow your bandwidth" cookies. Everything is pumped and primed. However, the project has not yet been released on the community. Why? Because Valve has not yet agreed. As soon as the green flag is given, the experiment will begin full-force.

As a side note, a previous article I posted about how SDS would acquire information is wrong. Actually working on the program and digging through the Steam Community site has enlightened me a little more about the crawling process. The article will be corrected as soon as I have time.

Sunday, November 29, 2009

SixDegreeSteam Mailbag 2

@SixDegreeSteam: Bump bump bump! Updates, please!

- Anonymous

Good timing! It just so happens there is yet another development in the SDS programming progress. I decided to set aside all of the existing code for further development and implementation at a later time, and start over yet again. In this new iteration, I intend to keep things simple with a focus on mechanics and practicality, not resources and scalability. This aids the project in two ways. First, interest in hosting slave servers has been very low. So, it seems like a waste to work so hard on a component that may not even be used (or necessary, for that matter). Second and most obviously, deployment has stalled, and stalled, and stalled some more. I am very dissatisfied with the current timeline for completion compared to the originally proposed timeline. Going back to basics will speed up development phenominally and consequentially lead to a much closer launch date.

Where will the project be accessible from?

- Anonymous

The brand-new subdomain on my brand-new server/site will be the point of deployment. While we are on the subject, I will also be moving my blogs to the new site once the general site is set up.

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Saturday, October 24, 2009

SixDegreeSteam: Programmatic

First of all, thanks to everyone who have subscribed to my Devlog, sent me emails, or have otherwise expressed great interest in SixDegreeSteam. You all serve as my motivation, and I am very proud to be so highly regarded. Thank you all! Now, as promised, here is another SDS tech article.

From the conception of the project, SixDegreeSteam has changed fundamentally a couple times. The reason for this is because there are so many factors to consider that it is near impossible for me alone to take them all into effect. I've sat down and worked out the best approaches I can conceive, but after putting them all into working code, something new will show itself and justify a rethinking. The major issue here is that there are two conflicting, equally and critically important aspects to this project; bandwidth and memory usage.

Bandwidth and memory are quasi-mutually exclusive. In other words, you can not optimize one in a program that utilizes both technologies without negatively impacting the other. The reason for this is simple. In SixDegreeSteam, with the exception of static and very few dynamic data, anything that we transfer (bandwidth usage) is useless to the crawler and is therefore unnecessary to keep in memory (memory usage). Likewise, anything kept on the client side for a longer time, indefinately, or otherwise in a non-time-critical manner is not likely to be needed by the server, hence memory on the client is used in favor of bandwidth. While this trade-off does have its implementations, bandwidth is a much more precious commodity in the long run. Memory can be reclaimed by throwing away data, but bandwidth comes in limited amounts (for most of us) on a monthly basis. However, the more memory available to the program, the faster and larger the resulting dataset can be. I've had to rewrite the messaging protocol between the server and client once, the crawler module thrice, and the program base twice as a result of trying to find the best trade-off between the two technologies. I have a sick feeling there will be a few more rewrites to come, and many more after the project goes public.

The bandwidth-versus-memory conundrum has led to a change in the structure of the program as a whole. In previous articles and discussions, I spoke of a multi-threaded program layout consisting of a server and a client (slave threader), with the client managing "slave threads" responsible for the actual crawling operations. As of this writing, multi-threading is still a part of the server aspect in order to allow the server to communicate with multiple clients at once. But, the client has been changed from the slave threader approach to a single-thread, single-crawler, tree-based data collection scheme. The client starts by requesting a Steam profile to begin crawling from the server. Then, an object at the top of a tree data structure is created to represent the profile. The object then crawls the ID, friends, and groups of the profile, creating leaf objects for each in the data structure as they are found. After the data structure reaches a set capacity, it's compressed and sent to the server. The client then removes the obsolete data and continues crawling profiles not yet completed. Two questions arise from this approach.

First, how do we make sure we aren't wasting memory and a small amount of bandwidth by crawling profiles the server already has on record? The simple answer: by wasting an even smaller amount of bandwidth between the client and server to ask if the server needs the profile. When a profile is queued for crawling on the dataset, the client will send a message to the server containing the profile ID. The server will then respond with a message stating whether to continue crawling or remove the object from the queue. Herein lies the problem outlined before by our memory-bandwidth trade-off. On one hand, we could save bandwidth between the client and server by just crawling all objects put on the stack at the expense of the memory the objects would consume. On the other, we could save that memory at the expense of bandwidth to ask the server if all the profiles we queue have already been crawled. The problem is made even more complex by the thought that any profiles we remove from the dataset after sending to the server or after discovering it has already been crawled will not be known to the client in the future. The possibility of the profile resurfacing is almost definate (which is the purpose of the project to find out!). So, question two arises.

Two, how do we stop the client from crawling profiles more than once? As the answer to question one pointed out, we have the opportunity to just ask the server if we have crawled it before or not. But, this will inevitably turn out to be a gigantic waste of bandwidth because we will have to ask the server the same question more than once to receive the same answer. The alternative is to implement a low-impact cache, capable of quickly and compactly storing/recovering which profiles the client has been instructed are crawled. And, you guessed it, the trade-off is presented. The cache increases memory usage to relieve bandwidth usage.

This is a small glimpse into the current layout of the program. It most likely does not do justice to the project complexity, but should work to give you an idea of how the program will work and why it's taking me so long to complete it.

Sunday, October 04, 2009

SixDegreeSteam Mailbag

Any updates on SixDegreeSteam? You said you would keep us posted, but it's been a while since I heard anything.

- Anonymous

Work and college stuff has been keeping me pretty preoccupied, so I havent had time to do a whole lot with SDS lately. Im currently working on a low-bandwidth method of transferring data between the clients (slave threaders) and server. I also hope to post another article about the technical goodies behind the project, like what exactly a slave threader is. In the meantime, Ive been considering multiple web hosting opportunities to run the SDS server on. I think I will also use the space to start work on the full-fledged BlakeNet (under a different name, as I said before). I still have no word from Valve about using Steamworks or the acceptability of the bandwidth costs on their end. Ill have to shoot some more emails later.

Could I convince you to make SDS open-source?

- Anonymous

I have never been one to hold source code for personal projects from anyone interested in reading through it. However, I will have to keep it under wraps until development has ceased and I feel confident with its efficiency. Shortly after the project goes main-stream, the source code will be readily available for all to enjoy.

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Saturday, September 12, 2009

SixDegreeSteam Official Group Open

The official SixDegreeSteam group (SixDegree$team) is now open for all interested Steam users to join. While project news and developmental goodies will be pushed here first, the SDS Steam group will be updated with important project updates only. Joining is a good way to get involved and show your support if you use Steam. Being a member of it will also grant you bragging rights when/if the project becomes mainstream, because members of the group will be the first to be crawled when we launch!

Thursday, September 10, 2009

SixDegreeSteam: Challenges

There are a few caveats to consider when developing this project. I have laid the information out in a (hopefully) easy-to-follow situation-caveat format. An important thing to remember is that not all caveats have an inherent solution, and the best solution is not always the most favorable.

Situation: First and foremost is the way Steam Community presents the information we are hoping to use. Since all information will be acquired by following "tips" from one page to a handful of others, and continuing on this pattern until all pages are analyzed, we call the process crawling. It works similar to a crawler for a search engine; Where a search engine crawler follows links on one page to others, our crawler will read IDs from one page and create new links pointing to sequential pages. Crawling friend information is easy, as the entire list of friends for a profile is in an HTML file. All we have to do is download the file from the site, extract all friend IDs, and queue them for crawling. Similarly, there is an XML file for each user containing profile information, including the groups the user is a member of. As far as crawling groups goes, a set of XML files for each group exists that lists all member IDs (one XML file per 1000 members). Note that the XML file we use to get group memberships of a user is not ideal, since it contains a lot more information than we need. This means that the speed of download and bandwidth usage (on both the crawling server and the Steam Community network) are impacted unnecessarily. A solution to this may be to use the Steamworks API available from Valve to developers, which features the ability to directly access community information without the need to crawl community pages, thus reducing bandwidth usage and overhead. However, Valve has declined granting me usage rights to Steamworks.
Caveat: Steam users have the ability to set privacy options that affect who is allowed to view their profiles. The current privacy options are private, public, and friends-only. Our crawler will only be able to use public profiles. Something to note, though, is that groups are always public, therefore all members are visible. This means that we can at least gather group membership information for non-public profiles. Likewise, if the private user is a friend of a public user, we will still be able to link the two users in our dataset since friendship on Steam is mutually inclusive (user A cannot have user B on their friend list without user B also having user A on theirs).

Situation: As was previously stated, the project is very web-intensive, utilizing a crawler technique to gather our data. This is a massive undertaking, requiring a huge amount of page/file downloads and subsequent parsing. While it is impossible to give an exact number to represent the size of our resulting dataset, the number of files we will need to download can be represented by a simple formula. We need at least one page per user (the profile information XML file, which also contains group membership information). This page is used to tell if the profile is public by checking to see if we get an access error, data is missing, or the privacy setting is explicitly shown. If its private, it would be wasteful to download the next file, as we dont have permission to view it and will just get a similar error to the first file. If its publicly viewable, we download a second file (friend list HTML) to complete our data collection for that user. Groups are a lot simpler as they are always public and will always have at least one member, therefore at least one file (the member list XML). Our formula to represent how many files we will need to download is A + B + (C * ceil(D/1,000)), where A is the number of Steam users, B is the number of users with public profiles, C is the number of groups, and D is the average number of members per group. As of this writing, there are over 53,300,212 Steam users and 1,101,806 groups. Guessing that at least three quarters of all users have public profiles and every group has 2,000 members (in reality, some users are only in one group where others are in one hundred, and some groups have one member where others have one million. 2,000 members per group is the best on-the-spot average estimation I can come up with for now), our file count is 53,300,212 + (53,300,212*(3/4)) + (1,101,806 * ceil(2,000/1,000)) = 53,300,212 + 39,975,159 + 2,203,612 = 95,478,983.
Caveat: To put that into data consumption terms, a member list for a group that has 1,000 members is about 42 KB, a friends list containing 50 friends is about 40 KB, and a complete profile page (used to get group memberships) displaying 10 group memberships is about 14 KB. Its very rare that a single file will be greater than 42 KB. Also, the user profile XML is what we use to tell if a profile is private or not, so its important to point out that the file returned for a private profile is about 366 B. So, modifying our previous formula to calculate bandwidth usage, we get (53,300,212 * 0.36) + (39,975,159 * (40 + (14-0.36))) + (2,203,612 * 42) = 2,256,007,309 KB = 2.10 TB. That is a serious amount of data to be transferring! We wont keep all of it, but as pointed out in Caveat 1, this is how we have to collect the data for lack of an alternative.

Situation: Caveat 2 points out that there is a lot of data transfer involved. So, bandwidth usage is definately an issue. Naturally, processing time for the data is also a factor. We cant provide accurate estimates for how long the data parsing for each of the three files would take because some computers are faster than others, processing time changes as memory becomes/is no longer available, etcetera. In either case, the obvious solution to speeding up both crawling and processing, as well as spreading bandwidth usage, is to use more than one computer. To speed processing up even further, we can create more than one instance of the program and have each instance work on a different page concurrently. This means we can process ten or more pages in the time it would take a single instance to process one. I will discuss both of these in-depth in a later article.
Caveat: Creating a control structure that can manage multiple instances (called threads) is one thing. Creating a server program that can remotely manage multiple programs running multiple instances is a completely different thing. The architecture of the server and its complimenting "slave threaders" will have to be such that each slave threader is capable of working, to an extent, on its own. The only time the slave threader should call home to the server is when the global scope of the dataset needs to be analyzed or all threads are idle (more work is needed). We adopt this minimalistic communication policy to make sure the bandwidth usage of the data between the two components does not become a factor in itself. This is a very intrinsic topic that requires extra thought in its own. As such, a future article will be devoted to this topic alone.

These are only a handful of issues we need to consider when tackling this project. They represent the major challenges I am facing right now in the design of the program. In an upcoming article, Caveat 3 will be examined more closely to show how I intend on structuring the server-slave threader relationship and my implementation of thread pooling.

Edited: This post was updated for correctness on January 14, 2010.

Wednesday, September 09, 2009

SixDegreeSteam: The Intro

As my previous article hinted to, I am planning to write a program that expresses the link between every Steam user. As of this article's writing, there are exactly 53,300,212 accounts registered with Steam. This includes active, inactive, banned, duplicate, publisher/staff, and Internet Cafe ("multiseat") accounts. It is impossible to distinguish just how many of those accounts are actually worth the effort of this project without walking through each account profile (ironically, that's just what this project sets out to do). Each of those accounts is allowed to have a maximum of 255 friends, and an unknown (perhaps unlimited?) amount of group memberships. Now that we have some numbers down, let's talk about the actual program.

Introducing SixDegreeSteam, an attempt at mapping the Steam social network. The name is a reference to the Six Degrees of Separation sociological theory that the project is based on, which states that every person in the world is linked to every other person by no more than six people. My theory is that a similar phenomenon exists between all social mediums, specifically the Steam platform's underlying network Steam Community. However, there is a fundamental difference between Steam and other social networking mediums; Some people only use Steam to game, not make friends! This means that some accounts will have 100 friends and be members of 10 different groups, while others will have 0 friends and 0 groups. The network is even more obfuscated because some people may have friends and group memberships, but only with people they know well. This makes the web of connections incredibly shallow and in severe cases exclusive, with no connections to the mainstream web or other exclusive webs. In other words, "Everybody knows everybody" only works if "everybody" is not limited to a select few.

There are also a couple technical caveats we must address. They will be discussed in an upcoming article.

Monday, September 07, 2009

Six Degrees of Inactivity

Where did you go??? It's been months since your last post! Are you working on anything interesting?

- Anonymous

First off, sorry for the inactivity, guys. Ive been insanely busy lately with a lot of stuff, mostly rectifying a couple situations in my personal life. Now, between college work and my new job, I dont have much time to myself. What little time I do have, I use to play games with friends or work on little side projects. Because of this, I wouldnt expect to hear too much from me for a while. Feel free to email me to say hi or catch up if youd like! I check my email daily when I get home from work.

Having said that, I do have a major project Im in the middle of conceptualizing. Circumstances providing, Ill make an in-depth post about it later. For now, the singular concept with have to suffice. Have you ever heard of the Small World Phenomenon, or more to the point, the Six Degrees of Separation model? They are very interesting ideas that attempt to link every person in the world. The idea was built on to create some popular games like Six Degrees of Kevin Bacon, Six Degrees of Wikipedia, and even a Facebook application. Surprisingly, there is no implication of this idea for the wildly popular Steam platform. Well, I think its about time someone puts an end to that, and who is more qualified to do so than me? :)

The goal of the project is to create a mapping showing how every Steam user is linked to everyone else. The project will take into account friends, as well as group memberships. This is a very exciting project for me and a major undertaking. More information to come when/if I get the time!

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Saturday, May 30, 2009

Multiplayer Ports

Your multiplayer examples use port 8087. Is that required, or can I change that? Is there any importance to it?

- Anonymous

If you recall my multiplayer design series, the example code I published used port 8087. There is no real significance to this port. I just thought it had a nice ring to it. :)

In general, you can change the port to whatever you want. However, you should avoid ports that are used for other services. If the port is already in use, errors could occur. The ports used by system services and common programs should never be used. These are all ports in the range of 1-1023. To find a port that is unregistered or used by an exotic program, I use this Wikipedia article. As you can see, port 8087 is actually registered to some other programs, but they are not very common.

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Saturday, April 11, 2009

Carriage Returns

Recently, I stumbled upon (literally) an article written by Chad W. L. Whitacre discussing a use for the infamously useless carriage return. Any programmer who has worked with direct input to/from file buffers has probably encountered a character that accompanies the newline character on some file systems. The usual sequence NTFS uses to terminate a line and move the internal pointer to the beginning of the next line is \n\r, where \n writes a null line terminator at the current position (effectively moving the pointer to the next line), and \r writes a carriage return.

If you stop to think about it, it makes sense. When you begin a new line, you want to make sure the internal pointer is writing to a fresh, blank line. So, first you move to the next line (via a newline character), then erase everything on the current line (via a carriage return). Beyond file input, Chad found that the carriage return could be used to erase data on the current line of a system buffer as well! In his article, he talks about using this technique in a TTY shell to make an updating progress bar. I thought I would take it a step further and see if it was implementable in a Java command line program. Sure enough, it works! The following is a basic program that uses a progress bar that updates as the program runs:

import java.io.*; class progressbar { public static void main(String[] args) { System.out.println("Running a really long loop..."); // How many times the loop should run int runcount = 500; // Prepare the progress meter System.out.print("[0/" + runcount + "]"); for (int i = 0; i < runcount; i++) { // Code that takes longer than a millisecond to execute goes here // Update the progress meter System.out.print("\r[" + (i+1) + "/" + runcount + "]"); } System.out.println("\rdone!"); } }

Now no one has an excuse to not have progress displays in their programs! :)

Monday, March 09, 2009

TF2 Oopsies

Lately, Ive been doing SourceMod plugin development for everything from infinite ammo/uber to new gameplay modes for use in BZG and other TF2 servers. This morning I was testing my newest gameplay plugin called Hot Potato. It was the first real vigorous test I had run with the plugin, and dramatically failed in flames of game crashes. I managed to capture a (laggy) piece of the carnage and post it on YouTube. Check it out. As the video description states, the plugin crashed at least four people, spawned 200+ entities within 10 seconds, and made 2244 function calls within 60 seconds.

I guess this serves as proof that pros are still prone to screw up. :p

Monday, February 16, 2009

The Brazen Guard

The Brazen Guard (BZG) is this awesome online community I have been a member of since May 25, 2008. Currently, we operate two Team Fortress 2 servers and one Gmod server. We play strictly noncompetitive, although we do have a small group dedicated to clanwars and scrimmages.

Around January 24, 2009, I was given RCON administrative access and a place in the leadership of the clan. Since then, BZG has undergone some great changes thanks to a revitalization of the community. Something we do regularly is hold these things we call Event Nights. Every Friday and Saturday, we lock one of our TF2 servers and only allow BZG members in. Then, we play the game using our own little minigame ideas. Its absolutely insanely fun. Everyone has a great time and the brotherly bond we all share becomes stronger.

The whole purpose to this post is to inform everyone about BZGs own Youtube channel, The BZG Authority, where I will be posting video highlights of our scrimmages, Event Nights, and other madness. You can find more information about the servers and the clan at http://www.clantoolz.com/brazenguard. We are by no means an exclusive community, but if you would like to join, you have to play with us on our servers a couple times so we can be sure youre serious about becoming part of the family.

Anyone who has TF2 or Gmod, come on and play with us! Otherwise, enjoy the videos and be sure to rate/comment!

Wednesday, January 07, 2009

We're Back!

BlakeNet is back online. This year marks the chance for a lot of new stuff on BlakeNet. The largest news is that I plan to start seeking sponsorship and analyzing development costs to upgrade the server, get some video equipment, and turn BlakeNet into the comedy video site it was invisioned to be. Dont expect to see these changes happen over night, but if all goes well, we might have an awesome new project on our hands! :)

Blake's Coding Blog