Blake's Coding Blog

MySQL Date Comparison Performance

2014-10-01T16:05:00.000-05:00

As the webmaster of a major IT solutions provider, one of my responsibilities involves optimizing queries against our MySQL database in order to ensure fast page loads. Like most programming languages and technologies, there are often multiple ways to perform a certain action or get a certain result, but only a few of those possible approaches are efficient. MySQL is no exception. Hundreds of articles exist that compare the performance of different query strategies. However, not all scenarios have been exhausted yet. Today, I will be examining one of those scenarios: range comparisons against columns of the DATE datatype in InnoDB.

The slow query that posed the original question for me is a compound query responsible for gathering traffic counter statistics for a web form we have. Each time a visitor to our site interacts with the form by viewing, beginning, or submitting it, we update a counter column in our traffic table on the row that contains the current date and interaction condition. Later, an administrator can access these statistics via a page that executes a query that accumulates the counters for each condition in a couple different date ranges: today, yesterday, this week, last week, this month, and this year. The table structure and the broken-out queries look like this:

CREATE TABLE `landing_submission_new_traffic` (
   `traffic_condition` enum('viewed','started','finished') NOT NULL,
   `traffic_date` date NOT NULL,
   `traffic_count` smallint(5) unsigned NOT NULL default '0',
   PRIMARY KEY  (`traffic_condition`,`traffic_date`)
) ENGINE=InnoDB;

# Today's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND traffic_date = CURRENT_DATE), 0) AS today_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# Yesterday's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND traffic_date = CURRENT_DATE - INTERVAL 1 DAY), 0) AS yesterday_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# This week's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEARWEEK(traffic_date) = YEARWEEK(CURRENT_DATE)), 0) AS week_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# Last week's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEARWEEK(traffic_date) = YEARWEEK(CURRENT_DATE - INTERVAL 1 WEEK)), 0) AS lastweek_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# This month's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEAR(traffic_date) = YEAR(CURRENT_DATE) AND MONTH(traffic_date) = MONTH(CURRENT_DATE)), 0) AS month_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# This year's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEAR(traffic_date) = YEAR(CURRENT_DATE)), 0) AS year_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

At the time of this writing, the table has 1,778 unique rows (401 unique dates) with the compound query taking about 4-7 seconds to execute with caching disabled on a high-load shared database. Obviously, a high-load shared database isn't the best candidate for running time-critical queries, and the primary query could be rewritten to perform better (consider joins and unions), but let's focus on the hypothesis that the subqueries are less than optimal because of the ways they compare dates. Checking each query individually shows that the first two queries, today's and yesterday's traffic, run in less than ¹/₁₀ secs each while the other queries run in about 1.5-2 secs each. This makes sense because the first two queries run a simple equality check against the date. So, what's making the others so slow? They all use MySQL internal functions and minimal arithmetic, but could that really be the cause? And if so, what should we use instead?

Let's see how different forms of year, month, and week queries perform. First, we need a sterile test environment where we can get consistent execution times without the added noise of a live production system. I chose to use a low-load VPS where the database isn't being accessed by anything else and created a table containing only the dates and row IDs (for index hit testing. This'll make sense in a bit). I also exploded the dates out to year, month, and day of month for an integer comparison test case. I then populated the table with ~1,000,000 rows of 397 dates. The quantity of dates is unimportant. The dates were generated from a backup of the production table referenced earlier being repeatedly inserted until the table was filled with at least 1 million rows. The table structure follows:

CREATE TABLE `test_dateindex` (
   `test_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
   `test_date` date NOT NULL,
   `test_date_year` year(4) NOT NULL,
   `test_date_month` tinyint(2) unsigned NOT NULL,
   `test_date_day` tinyint(2) unsigned NOT NULL,
   PRIMARY KEY (`test_id`),
   KEY `test_date` (`test_date`),
   KEY `test_date_unfolded` (`test_date_year`,`test_date_month`,`test_date_day`)
) ENGINE=InnoDB;

Before we get to the testing, let's review some important characteristics of MySQL and the InnoDB engine...

Test cases for one engine don't necessarily translate to another. In this test case specifically, the index layout wouldn't make much sense for a MyISAM table, but lends to our benchmarking in InnoDB. This is because InnoDB places the primary key at the end of every other index while MyISAM does not. The term for this type of combined index is an extended secondary index. In order to ensure all of our queries only use one index or another, I've separated them with neither of them being the primary key. I chose to make a row ID primary key just to better illustrate that I purposefully avoided making either index the primary -- you could remove the column and primary key entirely without affecting our testing.
DATEs are accepted in multiple formats. When MySQL can tell you are trying to use a representation of a datatype, it automatically converts the representation to whatever makes the most sense in the context. For example, the string value "20" can be converted to the integer value 20 when MySQL detects an integer operation or comparison is being attempted. This can be an advantageous optimization because integer operations are usually much faster to perform. The same automatic type casting applies to the DATE datatype. The string '2014-01-24' and the integer 20140124 are both valid DATE expressions (though the string version is preferred as it's valid SQL). See B.5.5.2 Problems Using DATE Columns for more information.
Indexes can be accessed by prefix, both by column and value. If a table has a single index on ('column1', 'column2', 'column3'), a query accessing column1 alone or column1 and column2 together can take advantage of the index. However, a query accessing column2 alone, column3 alone, or column2 and column3 together can not. Similarly, a query accessing column1 and column3 together can take advantage of the index, but only for column1. Partial value checks are also subject to this prefix rule. If we want to find all rows with a string in column1 starting with "hello", the index can be used. If we want to find all rows with column1 containing "hello" in the middle or at the end of the string, the index can not be used. We could also use the index in checks on column2 and/or column3 for values beginning with a given string as long as the preceding columns on the index are also accessed by the query within the same constraints. This behavior is discussed further in 8.5.3 How MySQL Uses Indexes.
Computed values of a column can not take advantage of indexes. While built-in functions and operators are typically much faster than user-defined functions and stored procedures, anything that has to do some type of computation upon a column can't use the index to quickly find rows matching a given criteria. This is because the index only contains the raw values of the column, not possible results of a function. This is one scenario when denormalization should be considered: another column may be indexed which contains the precomputed result of the function, and the query can be rewritten to check this column for the expected result. I've done this with our test table by breaking the date into three separate columns containing the results of the YEAR(test_date), MONTH(test_date), and DAYOFMONTH(test_date) functions. To illustrate the performance gains of doing so, I've included both strategies in the test queries.

Now for the testing. For each of the ranges (yearly, monthly, and weekly), I have designed a couple different queries that find the same rows through different WHERE clauses and only return the constant TRUE in order to eliminate data retrieval overhead from the benchmarks. Each query is executed sequentially once with caching disabled and MySQL profiling enabled, then the sequence starts over. This occurs 5 times, and the minimum, maximum, and average execution time of each variant is recorded. I've created a PHP script to automate this process. If the minimum and maximum times for any single query appear to be too dispersed, the results are thrown out for that sequence and the script is run again until a good sample is collected. Additionally, the ranking of queries by average execution time must be recreatable by running the script for the sequence again after a few minutes, though the average execution times themselves may differ. In order for a query to be considered to outperform another, their average execution times must differ by at least ¹/₁₀₀ secs.

Yearly Queries

SELECT SQL_NO_CACHE TRUE FROM test_dateindex...
Query	Min	Max	Avg	Rank
...WHERE YEAR(test_date) = 2014	0.419515	0.525119	0.4713726	6 6 6
...WHERE test_date LIKE '2014-%'	0.442024	0.534959	0.5027824	7 7 7
...WHERE test_date BETWEEN '2014-01-01' AND '2014-12-31'	0.414662	0.493443	0.4522718	5 5 5
...WHERE test_date BETWEEN 20140101 AND 20141231	0.372847	0.428081	0.3940104	3 2 2
...WHERE test_date >= '2014-01-01' AND test_date <= '2014-12-31'	0.374320	0.483726	0.4403736	4 4 4
...WHERE test_date >= 20140101 AND test_date <= 20141231	0.354298	0.444041	0.4082758	1 3 3
...WHERE test_date_year = 2014	0.362729	0.424568	0.3899468	2 1 1

These queries were chosen for the following qualities:

Query 1 uses the built-in YEAR() and then compares the result to a constant. This is a common query because of its readability. This query is expected to be the slowest because it requires a computational comparison and therefore cannot use the date column's index.
Query 2 performs a string comparison on the prefix of the date (the year). String comparisons are slower than integral ones, but this query was chosen to see if the optimizer can check column prefixes of a date with a string. Remember: DATEs can be accepted in multiple formats.
Queries 3-6 use the two common range comparison methods with both DATE (string) constants and integers. This is to clarify which type of range comparison performs the best, even though the SQL standard suggests the string constant form. The integral versions are expected to outperform the string constant ones because integral comparisons are so fast.
Query 7 is a control variable in our experiment. It uses the exploded index containing precomputed values for the date column's various parts. This query is expected to be the fastest because it requires a simple, fast integral comparison against the index.

What do the results tell us? Surprisingly, query 1 is not the slowest, being slightly faster than query 2. This means that, like query 1, query 2 cannot use the index, which means the optimizer is unable to check column prefixes of a date with a string. Moving on to the range queries in 3-6, we see that there isn't a huge difference in BETWEEN versus integer inequality operators. Range comparisons involving integers seem faster than DATE (string) constants, however. This is odd since the SQL standard and MySQL both state a preference to the string constant form. In fact, both forms of range comparisons involving integers appear to be nearly as fast as our control query 7!

Monthly Queries

SELECT SQL_NO_CACHE TRUE FROM test_dateindex...
Query	Min	Max	Avg	Rank
...WHERE YEAR(test_date) = 2014 AND MONTH(test_date) = 1	0.249090	0.286419	0.2602324	5 5 5
...WHERE test_date LIKE '2014-01-%'	0.227903	0.249212	0.2352724	4 4 4
...WHERE test_date BETWEEN '2014-01-01' AND '2014-01-31'	0.044413	0.049277	0.0475242	3 3 3
...WHERE test_date >= '2014-01-01' AND test_date <= '2014-01-31'	0.041343	0.044828	0.042945	2 2 2
...WHERE test_date_year = 2014 AND test_date_month = 1	0.027241	0.031058	0.0291766	1 1 1

These queries were chosen for the following qualities:

Query 1 is similar to the yearly query 1. It uses the built-in YEAR() and MONTH(), and then compares the result to a constant. This is a common query because of its readability. This query is expected to be the slowest because it requires a computational comparison and therefore cannot use the date column's index.
Query 2 is similar to the yearly query 2. It performs a string comparison on the prefix of the date (the year and month). String comparisons are slower than integral ones, but this query was chosen to see if the optimizer can check column prefixes of a date with a string. Remember: DATEs can be accepted in multiple formats.
Queries 3-4 are similar to yearly queries 3-6. They use the two common range comparison methods with DATE (string) constants. As with the yearly queries, these are expected to be outperformed by integral versions, though I do not include them here for brevity.
Query 5 is a control variable in our experiment. It uses the exploded index containing precomputed values for the date column's various parts. This query is expected to be the fastest because it requires a simple, fast integral comparison against the index.

What do the results tell us? As expected, query 1 is the slowest by a difference of ³/₁₀₀ secs from the second slowest and nearly a quarter of a second from the fastest. The lesson learned from yearly query 2 applies to monthly query 2 as well: string comparisons against dates can't use the index or check prefixes. However, the difference between the two is that the monthly form outperforms monthly query 1, whereas the yearly form was slower than yearly query 1. My belief is that this happens because the yearly query 1 uses only one computational comparison. In other words, string comparisons against dates are faster than two computational comparisons, but slower than one. Looking at queries 3-4, we see another parallel to the yearly queries, being that BETWEEN and integer inequality operators perform about the same (a difference of less than ¹/₁₀₀ secs). We could also infer that either range comparison method would be outperformed by an integer-using counterpart, and that those counterparts would both perform nearly the same as our control query.

Weekly Queries

SELECT SQL_NO_CACHE TRUE FROM test_dateindex...
Query	Min	Max	Avg	Rank
...WHERE YEARWEEK(test_date) = YEARWEEK('2014-01-25')	0.227227	0.279895	0.2487122	5 5 5
...WHERE test_date BETWEEN ('2014-01-25' - INTERVAL (DAYOFWEEK('2014-01-25')-1) DAY) AND '2014-01-25'	0.013000	0.018527	0.0152778	4 4 4
...WHERE test_date BETWEEN '2014-01-19' AND '2014-01-25'	0.010643	0.012964	0.0118214	2 1 1
...WHERE test_date_year = 2014 AND test_date_month = 1 AND test_date_day >= 19 AND test_date_day <= 25	0.009818	0.014883	0.012639	1 2 2
...WHERE test_date_year = 2014 AND test_date_month = 1 AND test_date_day BETWEEN 19 AND 25	0.011837	0.015604	0.0138756	3 3 3

These queries were chosen for the following qualities:

Query 1 uses the built-in YEARWEEK() to compute the week of the date column and the current date. This differs from yearly query 1 in that two computations are performed instead of one for the sake of readability (I consider YEARWEEK('2014-01-25') to be more readable than simply 201403), though the second usage only needs to be calculated once for the lifetime of the query, not for each row. This is a common query because of its readability. This query is expected to be the slowest because it requires a computational comparison and therefore cannot use the date column's index.
Queries 2-3 are similar to yearly queries 3-6. They use the BETWEEN range comparison method with DATE (string) constants. As with the yearly queries, these are expected to be outperformed by integral versions, though I do not include them here for brevity. These queries are expected to perform nearly identically because query 2's computations only need to be done once for the lifetime of the query, similar to weekly query 1.
Queries 4-5 is a control variable in our experiment and are also similar to yearly queries 3-6. They use the exploded index containing precomputed values for the date column's various parts, each also using one of the two range comparison methods. These queries are expected to be the fastest because they each require a simple, fast integral range comparison against the index.

What do the results tell us? The results for the weekly queries don't tell us much because all queries excluding the first are equal for our purposes (difference of less than ¹/₁₀₀ secs). In nearly every test run, queries 3-5 swap rankings, though query 2 is the second slowest by a small margin. I believe the reason for query 2 being slower than the others is the one-time computational overhead. I had expected queries 4-5 to outperform query 3 because of their use of the exploded index, but that does not appear to be the case. No real advantage is demonstrated among queries 3-5.

...

In summation, we now have a few rules of thumb that we can and probably should apply to production-level queries. First, queries relying on per-row computed values are always the slowest. Unless readability is critical to a query in production, functions and operators acting upon a column should be replaced. Also, for all the debating of DBAs and programmers over BETWEEN versus less-than/greater-than operators, the difference is nearly non-existent. It's all a choice of personal preference. The real performance boost lies in the operand being an integer rather than a string constant, despite the recommendations to use the constants by various documentations. Finally, denormalization of a DATE column has negligible performance advantages over simply running comparisons against the date in whole, though the advantage does exist in most cases.

Happy optimizing!

New Website Launch and Job Hunting

2012-08-28T17:10:00.001-05:00

In case you haven't noticed, the BlakeNet site has been down for quite a while. In it's place comes a much more serious website whose scope is beyond BlakeNet's original experimental sandbox and download repository purposes. My new site, Subnet ROOT (SNR), serves to bring professional contacts, as well as personal and casual ones, all the information they might want to know about me. All content previously provided by BlakeNet will still be available through SNR. If there are any articles here that still link to the defunct BlakeNet, please let me know and I will be happy to provide the working alternative.

In the meantime, did I mention I'm looking for employment opportunities? I just finished up my Bachelor's of Science in Engineering and Information Sciences (Game and Simulation Programming concentration). Now I'm looking to get my foot in the door in the video game industry or other computer programming careers. Find me on LinkedIn (professional inquiries only, please) or contact me through SNR (professional or personal inquiries).

Edit: I'm no longer actively seeking opportunities, though I am always open to professional networking. Feel free to continue reaching out to me!

VSFTPD "refusing to run with writable root inside chroot" Work-around

2012-08-02T21:50:00.002-05:00

Hey there, everyone! I've been very busy lately with a couple different projects, but I thought I'd take the time to share my latest challenge and how I worked around it. The other day, I updated my Ubuntu servers from 10.04 to 12.04. This update brings with it a new version of VSFTPD (Very Secure FTP Daemon) which boasts some security improvements. Unfortunately, one of these improvements is causing a lot of headaches for me and other server maintainers.

VSFTPD has buffed up security pertaining to chroot'ed users. The (brief) reasoning behind it is that users jailed to directories they have write access on may alter scripts in such a way that would allow them to "break out of" the jail. While I acknowledge this as something to take into consideration, I feel the compromise between security and usability currently being forced onto us is unacceptable. Many configurations that worked perfectly fine before the update now throw an error when a chroot'ed user logs in:

500 OOPS: vsftpd: refusing to run with writable root inside chroot ()

A quick Google search turns up some obvious solutions, like stripping write permissions from the user on their home directory, moving files you want the user to have access to into subfolders (where they can have write access), or using the local_root config directive to jail users to the parent directory instead. You can even replace VSFTPD with the vsftpd-ext package, which supports the allow_writeable_chroot config directive. But, these weren't viable solutions for me; I didn't want to remove write access from users on the one place on the system I feel they should be guaranteed full-run over, I didn't want to allow nosey users to see what other files and folders were outside of their own area, and I didn't want to install an alternative package just to fix one simplistic issue. So I came up with my own solution!

My configuration is set up to not jail users by default. This allows accounts like mine and my operating partners to navigate the entire system, while also allowing me to limit what is visible to specific users for simplicity or to keep them out of places they shouldn't be in.

$ nano /etc/vsftpd.conf
...
chroot_local_user=NO
chroot_list_enable=YES
chroot_list_file=/etc/vsftpd.chroot_list
user_config_dir=/etc/vsftpd_user_conf

$ nano /etc/vsftpd.chroot_list
# List of local users to chroot() jail in their home directory when FTPing
foo
bar
foobar

Above, we can see that my config file calls for local users to not be chroot jailed by default, but to jail specific users listed in /etc/vsftpd.chroot_list, and to also find user-specific config files in the /etc/vsftpd_user_conf directory. The list file is pretty straight-forward (one username per line). This next part is where the magic comes in. The objective is to jail the users in such a directory that the only thing they could possibly see in their FTP listing would be their home directory. This directory must not have write access, and it must preserve the existing directory structure. There is a great utility for this sort of thing: links! But therein lies a problem. Directories can only be symlinked/softlinked, and FTP clients typically can't follow symlinks.

Luckily, there is another, much more powerful way to create links to directories. If you are familiar with the mount command, you know that it's used to create links between the filesystem and physical data. Did you know it can also create links between one area of the filesystem and another? The magical command uses the syntax mount --bind originalfile linkname. Now let's put it to work!

This next part will require you to know the superuser password. If you don't know it, but your account is allowed to sudo, skip the su root command and preceed each of the other lines with sudo.

$ nano /etc/vsftpd_user_conf/foo
local_root=/srv/ftp/foo

$ su root
# mkdir /srv/ftp/foo
# chown foo:foo /srv/ftp/foo
# chmod ug-w /srv/ftp/foo

# mkdir /srv/ftp/foo/foohome
# mount --bind /home/foo /srv/ftp/foo/foohome

In the first part, we see that we change foo's local root directory to /srv/ftp/foo instead of his typical home directory (in this example, /home/foo, though it could be anywhere). I chose to use the /srv/ftp directory to house these jails, but you can use any directory you'd like as long as it has world read-execute permissions. In the second part, we run a couple commands to create the necessary directories, set the owner and access flags, and finally link the original home directory to the new area.

That's all there is to it! Simply repeat the above process for each jailed user, giving each one their own vsftpd_user_conf entry and /srv/ftp folder. Please note that you will have to execute the mount --bind command again for each user after every system reboot. To avoid this chore, I created a script and referenced it in my /etc/rc.local file. You can find more details about writing scripts and making them execute each time your system starts on plenty of other sites. I hope this has helped!

SDS Status Report: April 2012

2012-04-21T18:12:00.002-05:00

Hello! I have a couple things here to talk about. First off, as I often do, I must apologize for my lack of attention to the SDS project lately. I was involved with my senior project and another project involving Second Life society, the two of them taking up pretty much all of my time. I've since graduated and restarted work on some projects I had abandoned during my educational pursuits, SDS included. So, I have some bad news, good news, and more bad news involving the fate of the project.

The bad news first. To put it bluntly, I discovered that what I was promising the final version of SDS to do is just not possible right now. Looking back on my proposals, I had adopted two mentalities about all of this. They were "this project is absolutely possible. It's just a matter of making things run fast enough to be convenient to the user" and "this shares concepts with GPS technology, and GPS is fairly fast, therefore this project can be made to be as fast as GPS." I found out empirically that it just can't be made to run fast enough (given the tools immediately available to me), and it's very different from GPS. More on the GPS bit later (in a future article). In conclusion, this project will never reach the state I assumed it would, and so I must insist that SixDegreeSteam be retired.

There's good news too. I'm determined to leave this thing in as much of a working state as can be managed. My intention is to abandon the database approach, create an easy-to-follow graphical output, and have it accomplish the original objective I set out to meet, which was to outline connectivity of one user to the rest of the community. I've already gotten a lot of work in on these points in the past couple days. There's a lot more to go, but at this rate, it'll be done before anyone knows it.

Now the final piece of bad news. Parts of the Steam Community API got updated recently, throwing a generously sized monkey wrench into what used to work in SDS. This slows progress down because I now have to go back and rethink what I already had down. In conjunction, I contacted the Valve employee I understand to be the lead developer on the Steam Community to ask for any information pertinent to the project, or some added public API functionality, or anything to help me out. I was met with a shockingly uninterested response I interpretted as "we've made tools available already. We have no more time to spend helping you datamine," which was understandable albeit surprising given Valve's invested interests in other community members and modders. Things have stacked against me and this project's success from the beginning. I was really hoping to catch a second wind from Valve.

To finish off this status report, I'd like to reiterate just how sorry I am to have taken so long with all of this, only to let everyone waiting patiently on it down. I've appreciated the interest, comments, feedback, and time each person has given to SixDegreeSteam. It's with a heavy heart and a shameful disposition that I must break this news to you. SixDegreeSteam will soon be concluded. After this next client release, it'll all be put away until I can acquire better means to meet all my promises. If you want to keep an ear in it so you can be aware of such a day, I suggest subscribing to this blog (for news on SDS and my other projects) or staying a member of the SDS Steam group. Again, sorry.

56 Days of Second Life

2012-01-05T14:17:00.001-06:00

Coming back from a long hiatus (wow, has it really been a year?), I went back to school earlier this week. My new classes have officially started, and boy are they monsters! The good news is this is the last stretch for me before graduation. The bad news is that these classes are going to eat up every ounce of my spare time.

One of the two classes I'm enrolled in now is an Emerging Technologies course. Along with a couple other assignments, we have to work on a project that demonstrates, well, emerging technologies! I've decided to do an 8-week-long project focusing on the social interactions that take place in Second Life. It was between that or a robotics project, and I don't feel confident enough with physical electronics to have my academic success ride upon it. At any rate, this project should be really interesting! If you'd like to follow it, I'm required to maintain a blog detailing my findings and analysis of what I observe. You can find it over here.

On a related note, expect an update on SDS pretty soon. It'll probably upset a lot of people, but it's high time I speak up about it. See you in a day or two!

SixDegreeSteam 1.6.6 "Pure Client" Beta Live

2011-01-22T14:12:00.003-06:00

Im very proud to announce the release of SixDegreeSteam version 1.6.6 ("Pure Client") available at the SDS site. This is a closed beta that anyone can download and use, but will only work with Steam profiles that are members of the SDS Steam group. This version is NOT meant to be representative of SDS, which is to say this is just a technical demo. It doesnt include all planned features, is expected to run pretty slowly (its dependant on your processor, memory, and internet speed), and will most likely not find relationships larger than 5 profiles long (at least not with the default user limit. Even if you up the limit, itll still take a really long time). So what DOES it do? Heres the list:

DOES find your relationship to closely related users (5 or less recommended)
DOES print found relationships in a list structure
DOES show the potential of the project, and why the dependence on the server is so important (seriously, do you really want to have to use a program this slow?)
DOES NOT generate links using groups or private profiles
DOES NOT have a graphical user interface (unless you count the command prompt as graphical)
DOES NOT print the relationships in a very user-friendly manner (yet...Im working on that)
DOES NOT find relationships unless at least one of the given profiles is a member of the SDS Steam group

This is really only meant to show the current standing of the technology and give a sense of why the project is taking so long to complete. Trying to get something that is inherently very slow to work fast enough to be usable is not an easy task. Just remember that this is only the start of a potentially great project. Please be patient.

In the meantime, I hope everyone who tries the beta understands that it isnt, nor is it meant to be, very good or user-friendly. There is a lot to still be done. If you have comments, please let me know. I love hearing from you all!

Thanks so very much for your gratious and humbling interest
- Blake (ROOT)

Getting Past OnLive's LAN Requirement

2010-07-03T17:34:00.002-05:00

I was lucky enough to be part of the OnLive Founding Members promotion since I pre-ordered way back when the service first came around. If you were too, you might know about the current restriction in place on WiFi connections. Basically, it's a temporary means to limit the ways players can connect to the service in order to rule out connection reliability as a factor during these early stages when troubleshooting is inevitable. As far as I know, this is only meant to help cut down on the amount of calls-for-help from the technologically inapt. Imagine every noob using the service via a wireless connection flooding support forums, inboxes, and phones with problems pertaining to their network connection speed, which is totally not OnLive's problem/fault.

However, I don't think it's all that fair to limit those of us with prominent networking experience and wireless connections. So, I found a way past it. WARNING: Using this method of connecting to the OnLive service will most likely make you ineligible for support from OnLive until either wireless connections are supported or you actually connect through LAN. YOU HAVE BEEN WARNED!

The fix is very simple and can be done on Windows Vista or 7 (perhaps other versions as well, but I haven't been able to verify). Follow these steps:

Right click your network connection icon in the notification area, then choose Open Network and Sharing Center.
1. OR open your Control Panel, click the Network and Internet category, then Network and Sharing Center.
Click Change adapter settings in the left pane.
Select the wireless network connection you use to connect to the Internet and an unused local area connection (click one, hold CTRL, then click the other).
Right click one of the selected items and choose Bridge Connections. A Network Bridge item should be created.

Enjoy! The only real side effect I noticed about this procedure is that your network connection icon may display the disabled/unconnected icon even if you are fully connected to the network and Internet. If you don't mind that, this work-around is pretty solid. When OnLive adds WiFi support or if you decide you want your LAN connection back, just go back into the adapter settings and delete the network bridge item.

PS: Add me as a friend or spectate me sometime! My gamertag is Blakeo_x.

SDS Status Report: February 10

2010-03-01T17:18:00.000-06:00

The past month has been very dull for the SixDegreeSteam project, unfortunately. A lot of personal baggage has bullied progress to the bottom of the stack. However, with a new month comes a renewed working spirit. February wasnt completely without change for SDS, though. The general crawling was halted early in the month. The decision to do so was based on the fact that an accurate backbone has already been developed (the purpose of the general crawling) and that crawling on a relational chain basis, which is much more time- and resource-friendly, will likely replace a huge portion of the general crawling dataset upon client release. The SDS crawler has also reached a new version. SDS Local Server 1.6.1 reduces the residential memory to the SteamUser class, features a slightly modified database querying scheme to [fractionally] speed up present and future data acquisition, and utilizes a new community page wherein friend data is stored in XML format, speeding up crawling and reducing bandwidth consumption.

March is likely to be a very productive month for SixDegreeSteam. With any luck, the client will be ready for public use soon!

SDS Status Report: January 10

2010-01-31T20:04:00.000-06:00

A full month of crawling has passed, leaving the database pretty full and everyone involved with the project ecstatic to see everything coming together. A lot of progress has been made since the last report, so there is a lot to report on here. Strap in!

On January 3, this discussion took place:

January 3, 2010 chat between Blake "ROOT" and Harry "BlackSyte":

Blake: SDS is running a lot slower now that things are filling up, and the partitioning [of the database in an attempt to make things faster] didnt go over too well. My next attempt will be to do manual partitioning, but I dont want to do it right now because I just worked pretty hard to repair the damage the automatic partitioning did...It slowed things down to 1 process per minute.

With the current table loads and everything back to normal, its about 50 profile processes per minute. Its not THAT bad, but it isnt fast enough. Something exciting, though; SDS has reached the edge nodes. The very first Steam user has been crawled, and the very last Steam user (as of this writing) has been crawled. All that is left is everyone in between! But, it stands to prove that my theory might hold weight afterall. It shows that someone, thereby everyone, in the SDS Steam group is connected to both the minimum and maximum edges.

On January 18, SixDegreeSteam Local Server 1.5.87C (the current version) was implemented. It is thus far the most stable release with many improvements over the initial release from December. Family 1.5 brought in better error handling, a logging mechanism, parameter-based execution, a "prioritize child nodes" option for high-priority queue entries, and some programmatic fine-tuning. Release 1.5.87 introduced multi-threading for running multiple local servers (crawlers) consecutively and fixed some profile tracking issues that were previously irreproducible in production.

On January 21, the client algorithm was successfully executed. The interface is still a couple weeks out, but is coming quickly. Work on the interface was stalled after the 21st due to coursework and may continue to be stalled for 4 more weeks.

As of this writing, the database reports these statistics:

- Users crawled: 2,518,356
- Groups crawled: 748,140
- Profiles pending: 5,988,765
- Total discovered users: 7,791,431
- Total discovered groups: 748,141
- Average time between discovery and crawling: 9 days

Bonus: This is a discussion that took place on January 5. It doesnt do much to prove progress of the project, but does offer some interesting food for thought:

January 5, 2010 chat between Blake "ROOT", Dallas, and Harry "BlackSyte":

Blake: Its amazing how quickly the graph edges progressed.
Dallas: But the surface has only been barely scratched.
Blake: Exactly! I mean, we arent even half way through all the users, yet a streamlined backbone has emerged. That raises a concern...Perhaps there are more profiles with a betweenness centrality less than or equal to their degree centrality than I thought. Since a rigid, well-defined backbone has already formed so early in the project, yet there are relatively no user discoveries, it makes me think[...]

There are two possible conditions under which this would happen, guessing that the crawler itself is not at fault, and my confidence of that is fairly high as of 1.5C. First [possibility], there are a shitload of people who form "clicks" [or] small collections of friends who do not join groups or befriend "outsiders"...and by shitload, I mean over 80% of the entire demographic. Thats hard to believe, but not improbable.

The second case, which is even mmore unlikely but is very possible given a generational standpoint, is that there are sections of the entire demographic who are only friends with other members of the same section. So, you have 4 million users in segment A who are friends with other users in that same section, but none of them are friends with users from section B.Its incredibly unlikely, but its a valid portrayal of an existing graph theory called generational demography -- Newer users tend to be friends with other newer users, while older users tend to be friends with other older users, and never shall the two meet.

Heres another interesting observation Ive made. The queue timeframe is currently 9 days...Now, what that means is that there is a 9-day waiting period between discovery and crawling. Its a common occurrence in the crawler that a profile will become invalidated within those 9 days; A wildly common occurrence, in fact. People are deleting their profiles or changing their profile names way too often, somewhere between 1 to 9 days!
Harry: Does that hamper the crawling process?
Blake: In the first release, yes. The crawler would actually crash with a fatal error because it was expecting the profile to be there, but it wasnt. That would happen in the first couple days of launch before I patched it, which is pretty funny because that means the profiles were becoming invalidated within 1-3 days!

SixDegreeSteam: Synopsis

2010-01-14T16:32:00.001-06:00

I was curious as to a means that people like myself could come to an understanding on roughly how this program works... I understand what the goal of it is, but how do you intend to do this?...

- TopRaman

While I have explained a lot of SixDegreeSteam's abstract mechanics in previous articles, I feel they were not programmatically detailed enough to give a good picture of how exactly the program operates. So, Ill try to fill that gap here. There are three components to the project; Ill try to discuss all of them as simply and cleanly as possible.

WARNING: Previous articles have focused on previous versions of the project. Likewise, this article with focus on the current version of the project as of this writing (1.5C). I feel this current version is the most efficient, so it is likely to stick around for a long time. But, this is a disclaimer just in case the project does change again substantially.

The first and (thus far) most time-consuming component is the crawler. Without it, the project would not have a dataset. The crawler has the straight-forward job of collecting links to other profiles from one profile, saving (queueing) those links, then opening them back up sequentially to collect more links. The process continues until all links are collected and thereby all profiles are analyzed. This process of collecting, storing, reading, and repeating is called crawling. The specific crawling logic will not be described here for the sake of brevity, but all the details can be found in an earlier post titled SixDegreeSteam: Challenges. Program-wise, though, the crawler downloads the pages containing the links, extracts the links using a combination of Regular Expressions and XML parsing, and inserts them into a SQL database table called the crawler queue. The crawler also extracts some basic profile information, such as SteamID, profile name, and avatar, in a similar manner and inserts it all into another table called the user dataset (or group dataset). The complete list of datums stored per user is SteamID, last crawl time (to prevent recrawling a user too frequently), profile name, avatar, friends, and group memberships. The complete list of datums stored per group is similar: SteamID, last crawl time, group name, avatar, and members.

Given that the crawler does its job correctly, we are left with a database filled (and I mean FILLED) with information about the users and groups of the Steam Community. Oddly enough, the information is pretty inconsequential without a way to harness the datasets. So, our second component, aptly named Pathfinder, is arguably as important as the crawler. Pathfinder has the soul purpose of using the information available through the database to calculate a lowest-cost path between two given nodes. To do this, Ive opted for an object-oriented rendition of the breadth-first search graph theory algorithm. Once again, for the sake of brevity, I wont go into detail about the algorithm. If you are curious about it, follow the link or run a Google search. There are tons of articles that have covered it far better than I can. After the algorithm is applied, a list of users and groups used to reach profile B from profile A the quickest is displayed with links to each profile and avatars for recognition purposes.

The third and newest addition to the component set is a sort of network browser. Using a graphical, navigable web of the dataset, users will be able to quickly traverse the entire Steam Community social network, allowing them to get a better idea of just where they fit into it all. This component is simply an auxiliary to the project and is not a real focus. As such, it will be the last to be implemented and will only even enter the development picture after the crawler and Pathfinder are thoroughly completed.

As was said about Pathfinder, the information gathered by the crawler is pretty useless to the end user until a method of putting it to work is created. This explains why the crawler has been operational for nearly a month now, yet the site is still empty.

SDS Finished (For Real This Time!)

2009-12-27T21:20:00.000-06:00

The centerpiece of the project, the SixDegreeSteam crawler/server, has been thoroughly completed. Since the promised launch date of December 25, the crawler was experiencing memory issues that would cause it to die roughly 5 minutes after startup. Im very pleased to announce that version 1.5.77C of the crawler (yes, it seriously took that many revisions to get everything up to working conditions) is now in effect, and running smoothly!

Now I can switch over to working on the client software. More about that will be released later. Here are some statistics about the limited runtime the crawler got between its crashes.

- Uptime: 2 hours
- Users crawled: 1,961
- Groups crawled: 10,337
- Users/Groups pending: 408,376

A special thanks to everyone who has sent me emails, joined the Steam group, and otherwise completely overwhelmed me with their gracious comments! Its truly inspiring to see so many people in support of what I invisioned to be a small personal project. As always, feel free to email me with questions, comments, concerns, or funny lolcats. blake.oxx@gmail.com (remove the dot)

SixDegreeSteam Finished

2009-12-16T19:46:00.000-06:00

After about 3 months of development, SixDegreeSteam has reached the final steps in the launch process. The database has been set up, the crawler has been completed, and the folks at Valve have been sent "please let me borrow your bandwidth" cookies. Everything is pumped and primed. However, the project has not yet been released on the community. Why? Because Valve has not yet agreed. As soon as the green flag is given, the experiment will begin full-force.

As a side note, a previous article I posted about how SDS would acquire information is wrong. Actually working on the program and digging through the Steam Community site has enlightened me a little more about the crawling process. The article will be corrected as soon as I have time.

SixDegreeSteam Mailbag 2

2009-11-29T13:26:00.004-06:00

@SixDegreeSteam: Bump bump bump! Updates, please!

- Anonymous

Good timing! It just so happens there is yet another development in the SDS programming progress. I decided to set aside all of the existing code for further development and implementation at a later time, and start over yet again. In this new iteration, I intend to keep things simple with a focus on mechanics and practicality, not resources and scalability. This aids the project in two ways. First, interest in hosting slave servers has been very low. So, it seems like a waste to work so hard on a component that may not even be used (or necessary, for that matter). Second and most obviously, deployment has stalled, and stalled, and stalled some more. I am very dissatisfied with the current timeline for completion compared to the originally proposed timeline. Going back to basics will speed up development phenominally and consequentially lead to a much closer launch date.

Where will the project be accessible from?

- Anonymous

The brand-new subdomain on my brand-new server/site will be the point of deployment. While we are on the subject, I will also be moving my blogs to the new site once the general site is set up.

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

SixDegreeSteam: Programmatic

2009-10-24T17:30:00.000-05:00

First of all, thanks to everyone who have subscribed to my Devlog, sent me emails, or have otherwise expressed great interest in SixDegreeSteam. You all serve as my motivation, and I am very proud to be so highly regarded. Thank you all! Now, as promised, here is another SDS tech article.

From the conception of the project, SixDegreeSteam has changed fundamentally a couple times. The reason for this is because there are so many factors to consider that it is near impossible for me alone to take them all into effect. I've sat down and worked out the best approaches I can conceive, but after putting them all into working code, something new will show itself and justify a rethinking. The major issue here is that there are two conflicting, equally and critically important aspects to this project; bandwidth and memory usage.

Bandwidth and memory are quasi-mutually exclusive. In other words, you can not optimize one in a program that utilizes both technologies without negatively impacting the other. The reason for this is simple. In SixDegreeSteam, with the exception of static and very few dynamic data, anything that we transfer (bandwidth usage) is useless to the crawler and is therefore unnecessary to keep in memory (memory usage). Likewise, anything kept on the client side for a longer time, indefinately, or otherwise in a non-time-critical manner is not likely to be needed by the server, hence memory on the client is used in favor of bandwidth. While this trade-off does have its implementations, bandwidth is a much more precious commodity in the long run. Memory can be reclaimed by throwing away data, but bandwidth comes in limited amounts (for most of us) on a monthly basis. However, the more memory available to the program, the faster and larger the resulting dataset can be. I've had to rewrite the messaging protocol between the server and client once, the crawler module thrice, and the program base twice as a result of trying to find the best trade-off between the two technologies. I have a sick feeling there will be a few more rewrites to come, and many more after the project goes public.

The bandwidth-versus-memory conundrum has led to a change in the structure of the program as a whole. In previous articles and discussions, I spoke of a multi-threaded program layout consisting of a server and a client (slave threader), with the client managing "slave threads" responsible for the actual crawling operations. As of this writing, multi-threading is still a part of the server aspect in order to allow the server to communicate with multiple clients at once. But, the client has been changed from the slave threader approach to a single-thread, single-crawler, tree-based data collection scheme. The client starts by requesting a Steam profile to begin crawling from the server. Then, an object at the top of a tree data structure is created to represent the profile. The object then crawls the ID, friends, and groups of the profile, creating leaf objects for each in the data structure as they are found. After the data structure reaches a set capacity, it's compressed and sent to the server. The client then removes the obsolete data and continues crawling profiles not yet completed. Two questions arise from this approach.

First, how do we make sure we aren't wasting memory and a small amount of bandwidth by crawling profiles the server already has on record? The simple answer: by wasting an even smaller amount of bandwidth between the client and server to ask if the server needs the profile. When a profile is queued for crawling on the dataset, the client will send a message to the server containing the profile ID. The server will then respond with a message stating whether to continue crawling or remove the object from the queue. Herein lies the problem outlined before by our memory-bandwidth trade-off. On one hand, we could save bandwidth between the client and server by just crawling all objects put on the stack at the expense of the memory the objects would consume. On the other, we could save that memory at the expense of bandwidth to ask the server if all the profiles we queue have already been crawled. The problem is made even more complex by the thought that any profiles we remove from the dataset after sending to the server or after discovering it has already been crawled will not be known to the client in the future. The possibility of the profile resurfacing is almost definate (which is the purpose of the project to find out!). So, question two arises.

Two, how do we stop the client from crawling profiles more than once? As the answer to question one pointed out, we have the opportunity to just ask the server if we have crawled it before or not. But, this will inevitably turn out to be a gigantic waste of bandwidth because we will have to ask the server the same question more than once to receive the same answer. The alternative is to implement a low-impact cache, capable of quickly and compactly storing/recovering which profiles the client has been instructed are crawled. And, you guessed it, the trade-off is presented. The cache increases memory usage to relieve bandwidth usage.

This is a small glimpse into the current layout of the program. It most likely does not do justice to the project complexity, but should work to give you an idea of how the program will work and why it's taking me so long to complete it.

SixDegreeSteam Mailbag

2009-10-04T17:23:00.005-05:00

Any updates on SixDegreeSteam? You said you would keep us posted, but it's been a while since I heard anything.

- Anonymous

Work and college stuff has been keeping me pretty preoccupied, so I havent had time to do a whole lot with SDS lately. Im currently working on a low-bandwidth method of transferring data between the clients (slave threaders) and server. I also hope to post another article about the technical goodies behind the project, like what exactly a slave threader is. In the meantime, Ive been considering multiple web hosting opportunities to run the SDS server on. I think I will also use the space to start work on the full-fledged BlakeNet (under a different name, as I said before). I still have no word from Valve about using Steamworks or the acceptability of the bandwidth costs on their end. Ill have to shoot some more emails later.

Could I convince you to make SDS open-source?

- Anonymous

I have never been one to hold source code for personal projects from anyone interested in reading through it. However, I will have to keep it under wraps until development has ceased and I feel confident with its efficiency. Shortly after the project goes main-stream, the source code will be readily available for all to enjoy.

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

SixDegreeSteam Official Group Open

2009-09-12T22:09:00.002-05:00

The official SixDegreeSteam group (SixDegree$team) is now open for all interested Steam users to join. While project news and developmental goodies will be pushed here first, the SDS Steam group will be updated with important project updates only. Joining is a good way to get involved and show your support if you use Steam. Being a member of it will also grant you bragging rights when/if the project becomes mainstream, because members of the group will be the first to be crawled when we launch!

SixDegreeSteam: Challenges

2009-09-10T15:03:00.009-05:00

There are a few caveats to consider when developing this project. I have laid the information out in a (hopefully) easy-to-follow situation-caveat format. An important thing to remember is that not all caveats have an inherent solution, and the best solution is not always the most favorable.

Situation: First and foremost is the way Steam Community presents the information we are hoping to use. Since all information will be acquired by following "tips" from one page to a handful of others, and continuing on this pattern until all pages are analyzed, we call the process crawling. It works similar to a crawler for a search engine; Where a search engine crawler follows links on one page to others, our crawler will read IDs from one page and create new links pointing to sequential pages. Crawling friend information is easy, as the entire list of friends for a profile is in an HTML file. All we have to do is download the file from the site, extract all friend IDs, and queue them for crawling. Similarly, there is an XML file for each user containing profile information, including the groups the user is a member of. As far as crawling groups goes, a set of XML files for each group exists that lists all member IDs (one XML file per 1000 members). Note that the XML file we use to get group memberships of a user is not ideal, since it contains a lot more information than we need. This means that the speed of download and bandwidth usage (on both the crawling server and the Steam Community network) are impacted unnecessarily. A solution to this may be to use the Steamworks API available from Valve to developers, which features the ability to directly access community information without the need to crawl community pages, thus reducing bandwidth usage and overhead. However, Valve has declined granting me usage rights to Steamworks.
Caveat: Steam users have the ability to set privacy options that affect who is allowed to view their profiles. The current privacy options are private, public, and friends-only. Our crawler will only be able to use public profiles. Something to note, though, is that groups are always public, therefore all members are visible. This means that we can at least gather group membership information for non-public profiles. Likewise, if the private user is a friend of a public user, we will still be able to link the two users in our dataset since friendship on Steam is mutually inclusive (user A cannot have user B on their friend list without user B also having user A on theirs).

Situation: As was previously stated, the project is very web-intensive, utilizing a crawler technique to gather our data. This is a massive undertaking, requiring a huge amount of page/file downloads and subsequent parsing. While it is impossible to give an exact number to represent the size of our resulting dataset, the number of files we will need to download can be represented by a simple formula. We need at least one page per user (the profile information XML file, which also contains group membership information). This page is used to tell if the profile is public by checking to see if we get an access error, data is missing, or the privacy setting is explicitly shown. If its private, it would be wasteful to download the next file, as we dont have permission to view it and will just get a similar error to the first file. If its publicly viewable, we download a second file (friend list HTML) to complete our data collection for that user. Groups are a lot simpler as they are always public and will always have at least one member, therefore at least one file (the member list XML). Our formula to represent how many files we will need to download is A + B + (C * ceil(D/1,000)), where A is the number of Steam users, B is the number of users with public profiles, C is the number of groups, and D is the average number of members per group. As of this writing, there are over 53,300,212 Steam users and 1,101,806 groups. Guessing that at least three quarters of all users have public profiles and every group has 2,000 members (in reality, some users are only in one group where others are in one hundred, and some groups have one member where others have one million. 2,000 members per group is the best on-the-spot average estimation I can come up with for now), our file count is 53,300,212 + (53,300,212*(3/4)) + (1,101,806 * ceil(2,000/1,000)) = 53,300,212 + 39,975,159 + 2,203,612 = 95,478,983.
Caveat: To put that into data consumption terms, a member list for a group that has 1,000 members is about 42 KB, a friends list containing 50 friends is about 40 KB, and a complete profile page (used to get group memberships) displaying 10 group memberships is about 14 KB. Its very rare that a single file will be greater than 42 KB. Also, the user profile XML is what we use to tell if a profile is private or not, so its important to point out that the file returned for a private profile is about 366 B. So, modifying our previous formula to calculate bandwidth usage, we get (53,300,212 * 0.36) + (39,975,159 * (40 + (14-0.36))) + (2,203,612 * 42) = 2,256,007,309 KB = 2.10 TB. That is a serious amount of data to be transferring! We wont keep all of it, but as pointed out in Caveat 1, this is how we have to collect the data for lack of an alternative.

Situation: Caveat 2 points out that there is a lot of data transfer involved. So, bandwidth usage is definately an issue. Naturally, processing time for the data is also a factor. We cant provide accurate estimates for how long the data parsing for each of the three files would take because some computers are faster than others, processing time changes as memory becomes/is no longer available, etcetera. In either case, the obvious solution to speeding up both crawling and processing, as well as spreading bandwidth usage, is to use more than one computer. To speed processing up even further, we can create more than one instance of the program and have each instance work on a different page concurrently. This means we can process ten or more pages in the time it would take a single instance to process one. I will discuss both of these in-depth in a later article.
Caveat: Creating a control structure that can manage multiple instances (called threads) is one thing. Creating a server program that can remotely manage multiple programs running multiple instances is a completely different thing. The architecture of the server and its complimenting "slave threaders" will have to be such that each slave threader is capable of working, to an extent, on its own. The only time the slave threader should call home to the server is when the global scope of the dataset needs to be analyzed or all threads are idle (more work is needed). We adopt this minimalistic communication policy to make sure the bandwidth usage of the data between the two components does not become a factor in itself. This is a very intrinsic topic that requires extra thought in its own. As such, a future article will be devoted to this topic alone.

These are only a handful of issues we need to consider when tackling this project. They represent the major challenges I am facing right now in the design of the program. In an upcoming article, Caveat 3 will be examined more closely to show how I intend on structuring the server-slave threader relationship and my implementation of thread pooling.

Edited: This post was updated for correctness on January 14, 2010.

SixDegreeSteam: The Intro

2009-09-09T16:46:00.005-05:00

As my previous article hinted to, I am planning to write a program that expresses the link between every Steam user. As of this article's writing, there are exactly 53,300,212 accounts registered with Steam. This includes active, inactive, banned, duplicate, publisher/staff, and Internet Cafe ("multiseat") accounts. It is impossible to distinguish just how many of those accounts are actually worth the effort of this project without walking through each account profile (ironically, that's just what this project sets out to do). Each of those accounts is allowed to have a maximum of 255 friends, and an unknown (perhaps unlimited?) amount of group memberships. Now that we have some numbers down, let's talk about the actual program.

Introducing SixDegreeSteam, an attempt at mapping the Steam social network. The name is a reference to the Six Degrees of Separation sociological theory that the project is based on, which states that every person in the world is linked to every other person by no more than six people. My theory is that a similar phenomenon exists between all social mediums, specifically the Steam platform's underlying network Steam Community. However, there is a fundamental difference between Steam and other social networking mediums; Some people only use Steam to game, not make friends! This means that some accounts will have 100 friends and be members of 10 different groups, while others will have 0 friends and 0 groups. The network is even more obfuscated because some people may have friends and group memberships, but only with people they know well. This makes the web of connections incredibly shallow and in severe cases exclusive, with no connections to the mainstream web or other exclusive webs. In other words, "Everybody knows everybody" only works if "everybody" is not limited to a select few.

There are also a couple technical caveats we must address. They will be discussed in an upcoming article.

Six Degrees of Inactivity

2009-09-07T18:54:00.004-05:00

Where did you go??? It's been months since your last post! Are you working on anything interesting?

- Anonymous

First off, sorry for the inactivity, guys. Ive been insanely busy lately with a lot of stuff, mostly rectifying a couple situations in my personal life. Now, between college work and my new job, I dont have much time to myself. What little time I do have, I use to play games with friends or work on little side projects. Because of this, I wouldnt expect to hear too much from me for a while. Feel free to email me to say hi or catch up if youd like! I check my email daily when I get home from work.

Having said that, I do have a major project Im in the middle of conceptualizing. Circumstances providing, Ill make an in-depth post about it later. For now, the singular concept with have to suffice. Have you ever heard of the Small World Phenomenon, or more to the point, the Six Degrees of Separation model? They are very interesting ideas that attempt to link every person in the world. The idea was built on to create some popular games like Six Degrees of Kevin Bacon, Six Degrees of Wikipedia, and even a Facebook application. Surprisingly, there is no implication of this idea for the wildly popular Steam platform. Well, I think its about time someone puts an end to that, and who is more qualified to do so than me? :)

The goal of the project is to create a mapping showing how every Steam user is linked to everyone else. The project will take into account friends, as well as group memberships. This is a very exciting project for me and a major undertaking. More information to come when/if I get the time!

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Multiplayer Ports

2009-05-30T11:18:00.004-05:00

Your multiplayer examples use port 8087. Is that required, or can I change that? Is there any importance to it?

- Anonymous

If you recall my multiplayer design series, the example code I published used port 8087. There is no real significance to this port. I just thought it had a nice ring to it. :)

In general, you can change the port to whatever you want. However, you should avoid ports that are used for other services. If the port is already in use, errors could occur. The ports used by system services and common programs should never be used. These are all ports in the range of 1-1023. To find a port that is unregistered or used by an exotic program, I use this Wikipedia article. As you can see, port 8087 is actually registered to some other programs, but they are not very common.

This article is a reply to an email. If you would like to ask a question or suggest a new article, email me. blake.oxx@gmail.com (remove the dot)

Carriage Returns

2009-04-11T10:44:00.003-05:00

Recently, I stumbled upon (literally) an article written by Chad W. L. Whitacre discussing a use for the infamously useless carriage return. Any programmer who has worked with direct input to/from file buffers has probably encountered a character that accompanies the newline character on some file systems. The usual sequence NTFS uses to terminate a line and move the internal pointer to the beginning of the next line is \n\r, where \n writes a null line terminator at the current position (effectively moving the pointer to the next line), and \r writes a carriage return.

If you stop to think about it, it makes sense. When you begin a new line, you want to make sure the internal pointer is writing to a fresh, blank line. So, first you move to the next line (via a newline character), then erase everything on the current line (via a carriage return). Beyond file input, Chad found that the carriage return could be used to erase data on the current line of a system buffer as well! In his article, he talks about using this technique in a TTY shell to make an updating progress bar. I thought I would take it a step further and see if it was implementable in a Java command line program. Sure enough, it works! The following is a basic program that uses a progress bar that updates as the program runs:

import java.io.*;

class progressbar
{
 public static void main(String[] args)
 {
  System.out.println("Running a really long loop...");
  
  // How many times the loop should run
  int runcount = 500;
  
  // Prepare the progress meter
  System.out.print("[0/" + runcount + "]");
  
  for (int i = 0; i < runcount; i++)
  {
   // Code that takes longer than a millisecond to execute goes here
   
   // Update the progress meter
   System.out.print("\r[" + (i+1) + "/" + runcount + "]");
  }
  
  System.out.println("\rdone!");
 }
}

Now no one has an excuse to not have progress displays in their programs! :)

TF2 Oopsies

2009-03-09T07:08:00.003-05:00

Lately, Ive been doing SourceMod plugin development for everything from infinite ammo/uber to new gameplay modes for use in BZG and other TF2 servers. This morning I was testing my newest gameplay plugin called Hot Potato. It was the first real vigorous test I had run with the plugin, and dramatically failed in flames of game crashes. I managed to capture a (laggy) piece of the carnage and post it on YouTube. Check it out. As the video description states, the plugin crashed at least four people, spawned 200+ entities within 10 seconds, and made 2244 function calls within 60 seconds.

I guess this serves as proof that pros are still prone to screw up. :p

The Brazen Guard

2009-02-16T06:20:00.002-06:00

The Brazen Guard (BZG) is this awesome online community I have been a member of since May 25, 2008. Currently, we operate two Team Fortress 2 servers and one Gmod server. We play strictly noncompetitive, although we do have a small group dedicated to clanwars and scrimmages.

Around January 24, 2009, I was given RCON administrative access and a place in the leadership of the clan. Since then, BZG has undergone some great changes thanks to a revitalization of the community. Something we do regularly is hold these things we call Event Nights. Every Friday and Saturday, we lock one of our TF2 servers and only allow BZG members in. Then, we play the game using our own little minigame ideas. Its absolutely insanely fun. Everyone has a great time and the brotherly bond we all share becomes stronger.

The whole purpose to this post is to inform everyone about BZGs own Youtube channel, The BZG Authority, where I will be posting video highlights of our scrimmages, Event Nights, and other madness. You can find more information about the servers and the clan at http://www.clantoolz.com/brazenguard. We are by no means an exclusive community, but if you would like to join, you have to play with us on our servers a couple times so we can be sure youre serious about becoming part of the family.

Anyone who has TF2 or Gmod, come on and play with us! Otherwise, enjoy the videos and be sure to rate/comment!

We're Back!

2009-01-07T20:10:00.003-06:00

BlakeNet is back online. This year marks the chance for a lot of new stuff on BlakeNet. The largest news is that I plan to start seeking sponsorship and analyzing development costs to upgrade the server, get some video equipment, and turn BlakeNet into the comedy video site it was invisioned to be. Dont expect to see these changes happen over night, but if all goes well, we might have an awesome new project on our hands! :)

BlakeNet Down

2008-12-23T00:27:00.002-06:00

BlakeNet will be down for a while. Our new computer just came in, which caused me to have to move equipment around and use the wireless device used to connect my server to make our second computer able to communicate wirelessly. Once I get back from Christmas vacation, I will start looking for a new wireless card so that I can get my equipment back online.

Merry Christmas. :]