Visualizing Steam: Thought Process and Reflections

In the previous post I talked briefly about the project regarding visualizing Steam data.

The following is a reflection piece regarding the process of creating the visualizations and dashboards:

What factors help to determine whether a video game is popular or not on an online distributor?  Is it the price, the ratings, or the genre?  Or is it something else entirely different?  In this visualization project, we examine the various facets of one of the biggest and most well-known video game digital distributors, Steam.

As one of the leaders of the video gaming industry, Steam is home to thousands of games, ranging from small “indie” (“indie” meaning “independent, similar to “indie” music, and “indie” films) games developed by a few developers, to major titles developed by “AAA” (“AAA”, or “Triple-A” refers to major or blockbuster level development) companies such as Electronic Arts or Nintendo.

Because there are thousands of games available on the service, picking out a game can be quite the daunting task for a first-time user.  Users might wonder, what makes a game so popular?  Is it the gameplay?  Or is it the graphics?  Or were they lured in by something not directly related to the gameplay, such as the price point or the reviews?

In order to visualize the data about the platform, we needed to find a dataset that contained data about the platform and its library of games, such as genre, reviews, year of publication, price point, and so on.  Surprisingly, Steam does not publish an official dataset for external use, so there were two options: 1) to perform web scraping and data mining methods, or 2) find a dataset from an user who already performed such methods.  For the sake of brevity and to stay within the scope of the project, we opted for the latter.

The dataset chosen was pulled from Kaggle, an open source repository for datasets, courtesty of the user nikdavis.  The dataset was arranged into three different spreadsheets, all of which have different components, from information about the video game itself (i.e: developer/publisher, size, price), to user based information such as time played, reviews, and so on.  The data was cleaned by the uploader, eliminating much of the cleansing portion of the data preparation phase, but there were still some aspects to be completed.  While nikdavis’ dataset is the most comprehensive dataset available from third-party users, there is a flaw in that the dataset is current only up to the year 2019.  Having data available for 2020 would have added more depth and flavor to the analysis and visualization due to the fact that 2020 was stricken by the Covid-19 pandemic.  As a result of the general public being forced to stay indoors to avoid infection, we most likely would have seen more data pertaining to video games available as greater numbers of people turn to video games as a source of information.

The datasets were combined by performing a join on the primary key of the datasets, which was the Steam video game, or application ID.  Each dataset had the same number of observations, with well over 27,000 rows per dataset.  The join type did not matter so much because the IDs provided a one to one relationship, but for best practices, an inner join was performed.

While nikdavis performed much of the cleaning, some columns in the combined dataset required cleaning.  This included separating genre and tags into their own individual columns, as they were combined into one column by means of a semi-colon acting as a delimiter.

The second data preparation step was to also separate the range of users that owned a specific game.  Having the exact number of users for a specific game would have been ideal, but users buy and return games constantly for any number of reasons.  Therefore, it is difficult to pinpoint an exact number, so a range of numbers was used instead. In addition to splitting the column, and converting the columns into numbers, we calculated an average for potential use.  We ended up with three numbers, the lower limit, the average, and the upper limit.  

The final piece of preparation was to create other relevant formulas.  The most important calculation of these was the adjusted Steam ratings formula.  The normal Steam ratings formula on the platform is simply the number of positive reviews divided by the number of total reviews.  A blog post produced by the website SteamDB explained that there were very few checks in place, stating,

“If you’ve ever taken a slightly longer look at the Steam store, you’ve probably noticed their method of sorting games by review score is pretty bad. They just divide the positive reviews by the total reviews to get the rating. A game with a single positive review would be ranked above some other game that has 48 positive reviews and a single negative review. While they do have “steps” at 50 and 500 total reviews, meaning that no game with a rating of at least 80% will be ranked below a game with less than 50 reviews, and no game with a rating of at least 95% will be below a game with less than 500 reviews, it’s still a bad system. Because if our 48 to 1 rated game suddenly accrued 11 more negative ratings, Steam would miraculously still place it higher than it was before.”

What this means is that in the example, a game with 48 positive ratings, and 11 negative, (80% overall rating), would be ranked higher than a game with 48 positive ratings, and 1 negative rating (97.9% overall rating) simply by virtue of the fact that there were over 50 ratings for the first game.

As a result, SteamDB created their own algorithm that scores games taking into account the number of reviews; namely, for every factor of 10, the level of uncertainly going toward a review is reduced by half.  So if a game without any reviews has 100% uncertainty, a game with 10 reviews has 50% uncertainty, a game with 100 reviews has 25% uncertainty, and so on.  It never quite reaches 0% uncertainty, but the confidence interval gets very close.  The code in Tableau looks something like this:


( Total Reviews = Positive Reviews + Negative Reviews )

( Review Score = frac{Positive Reviews}{Total Reviews} )

( Rating = Review Score – (Review Score – 0.5)*2^{-log_{10}(Total Reviews + 1)} )

With the data cleaning and preparation complete, the first step to introducing a topic is to provide the viewer a high-level overview about the subject.  Before any visualization was created, an introductory blurb was written to briefly describe Steam.  The visualization holds no weight if the viewer knows nothing about the topic at hand.  Providing them even with a modicum of knowledge about the subject is akin to holding the door open for them and clarifies the data tenfold.

However, as informative as a block of text can be, it is a dual-edged sword in that viewers may not be interested in reading a paragraph or two, no matter how short it is.  In order to entice the viewer, we opted to utilize simple text to encapsulate a single number.  We opted to choose the average number of hours a gamer plays.  It is a number many people are familiar with: averages, and now long people play a game.  In his text, “Storytelling With Data”, Cole Nussbaumer Knaflic emphasized the beauty of simplicity in communicating information.  However, he also emphasized the danger in reducing data to just a single data point, because of all the information that is hidden away.  Admittedly, it is difficult to condense all the playtime belonging to 27,000 games into one single summary statistic because of the number of potential outliers skewing the data in either direction, but that can be mitigated by using techniques such as weighted or trimmed means.

As one might have surmised at this point, the theme of the first dashboard was all about simplicity.  This was to ease the viewer into absorbing high-level, generalized information about the data before delving further into more granular topics.  The bar graph, and line graph were chosen because of their ubiquitous nature in data visualization, while still providing interesting trends.  For example, the line graph showed a gentle slope in the number of games that Steam released onto its platform on a yearly basis –  that is, until the year 2013, when the number of games on Steam spiked.  This is thanks to its (now expired) Greenlight service, which was utilized to help indie developers release their games onto Steam and not be overshadowed by AAA developers.  Because of this, more indie developers began releasing their games onto Steam, crowding the platform at a greater rate than previous years.

The last visual on the landing page is a tree-map.  The decision to use a treemap was inspired by Ben Shneiderman’s article, “Discovering Business Intelligence Using Treemap Visualizations”.  He argues that a treemap can provide complex information regarding hierarchies in data, while being simple enough for viewers to glance and derive information based on color and size.  Indeed, the original use for treemaps was to visualize the contents for hard drive disks with tens of thousands of files in five to fifteen levels of directories.  The same applies for the Steam data as well. There are numerous levels of genres, and sub-genres, which the treemap would accommodate perfectly.  However, because the dashboard embraces simplicity, the decision was made only to show the main genre for a game.  This was why the columns containing all three genres were split earlier during the data preparation phase.  The ensuing treemap details that people on Steam overwhelmingly prefer action-based games compared to the next most owned genre, adventure.  However, a detail that might get lost in the data here is that the number reflects the total number of games pertaining to the genre, not the number of distinct titles.  In order to capture that detail, we had to employ the usage of tooltips to turn implied messages into explicit ones.

The next step in visualizing data involved visualizing prices.  The reason why we focused on prices is because of the idea that prices can deter people from purchasing a game.  We also wanted to explore what the price ranges for games and applications in Steam were like, since the platform is known for its affordable games.  Performing exploratory data analysis and generating a table of outputs was straightforward, but the problem with generating such a table is that it would not be as compelling.  Creating a visual that properly explored the price range distribution was ideal, so we opted for a graph that would do that.  Boxplots were the answer to the solution, as they display both measures of central tendency (mean, median; etc), and spread (range).  

However, creating a single boxplot for every single game in Steam would not paint a complete picture, so we separated the boxplot by game genre in hopes that it would provide more insight on how price ranges differ.  The resulting visual proves that most games fit in a range of $0.00 to $69.99; however there are a number of outliers in the dataset.  The most expensive application was an educational one; it was made to help label makers identify which hazardous labels to place on vehicles.  We also identified that certain application types, such as those belonging to animation, modeling, design, and illustration were more expensive than their gaming counterparts, due to their usage for commercial business.  Other interesting trends that were uncovered included the fact that not every game belonging to the “Free to Play” genre were actually free-to-play.  These trends and patterns would be much more difficult to uncover if the data was purely in tabular or chart form.

With some of the basic details about the data completed, we took our focus to trying to identify if there was a relationship between the variables in the data.  In order to do so, we decided to utilize scatterplots to map two variables together.  There were two sets of variables we decided to examine: price, and rating, and hours played, and rating.  We wanted to examine what factors influenced the ratings of a game, and price and hours played seemed to be good indicators.  The rationale was that a game that is seemingly more expensive would be one that was higher quality because more time and resources were invested into the production of it.  Furthermore, hours spent on a game is also a good indicator, because if a game were enjoyable, people would spend more time playing that game as opposed to a game that was not as enjoyable.

The inspiration for these visualizations comes from knowing that identifying relationships in data is the crux of machine learning and other aspects of data science.  One potential point for improvement in future iterations of this project is to derive a model that best fits the variables, and use that model to predict what a game’s rating will be.

The resulting scatter plot showed that there was a positive trend in regards to games increasing in score relative to price, but the downside of using the scatter plot was the visuals for it.  Cutting out the outliers would have hidden important elements such as the most expensive of games underperforming heavily compared to their price point.  The underwhelming games detail another angle to the data: games that are more expensive tend to be met with more negative reviews if the quality did not match expectations.  In addition, it seems as though the number of free games also impacted the line that ran through the data points.  Filtering out free games gave an increase to the positive effective, which showed that generally, games that are paid for will have higher quality usually.  Another point for future iterations would be to add in a sliding scale to see how this line is affected as we adjust for different price cutoffs.

A similar trend and issue followed for the scatter plot exploring the relationship between hours played, and rating.  The overwhelming majority of data points totaled less than 10,000 minutes played, or the equivalent of nearly seven days.  However, the question remained whether scaling the data down is worth losing the outliers.  The cluster of data points aggregated below the 10,000 minute mark implies that games do get higher rated with more time played, but only up to a certain point.  This would make sense, as sometimes gameplay can get stale, or a newer, more interesting game gets released and the first game gets sidelined.

Finally we arrive at the last series in the story.  Because the dataset was an aggregation of data collected over the years on Steam, we decided it was only fitting that we utilize time series data.  In a blog post on time series data by managers Eugenia Moreno, and Brian Sheehan, they explain that time series data is rich in information allowing users to identify changes in behavior over time, and being able to compare different sets of data over the same time frame.  Regression models, similar to the one discussed earlier, could even be applied to help forecast future events.

While forecasting is not an option due to the limitations with the dataset (one row is equal to one distinct game and does not provide such information to make a forecast), we wanted to examine the same two aspects: rating, and play time and see how it changed over the years.  As such, the line graph was the perfect choice to display the information.  For the graph outlining playtime, the median playtime was utilized, as the median tended to be more resistant to outliers compared to averages of individual games.  From there we averaged out the median playtime.  Another option that could have been done to show the differences was to have one line represent the average playtime, and another, to represent the median.  In order to accommodate that, we would have to create a filter that displays only paid games as opposed to free games, otherwise we would have four lines appearing on the game, which creates clutter, even with a legend to differentiate between the sets.

Unfortunately, having all the data aggregated into a certain point meant that a lot of data got lost in the translation.  To give an idea of what titles helped to contribute to the data point, we decided to utilize a table with the top ten games for that year, with a dropdown feature to cycle through different years.  The general trend revealed that free to play games saw more playtime than games that were paid for, although 2017 and 2018 saw the patterns between the two even out.

The same thought process was applied to seeing ratings over time.  Surprisingly, we saw a decline in the ratings of games over time.  This could be ascribed to a number of factors, such as a large number of sequels with unoriginal gameplay, or gamers having higher standards for games.  That much is beyond the scope of the project, although it would be interesting to take the findings from this project and carry it over to future ones when new data becomes available.

In retrospect, this visualization project was a journey out of the usual comfort zone.  Being an employee for New York City meant that my usual dashboards lay in making compact, neatly outlined dashboards consisting primarily of contingency tables, and simple but effective graphs such as line graphs and bar graphs.  There was some reluctance in utilizing graphs such as treemaps, scatter plots, and boxplots, because I was wary making visually compelling graphs at the cost of losing the message I wished to convey to the audience.  I believe that the effectiveness was mixed.  While the treemap and boxplots served their purpose, I believe that I was unable to convey that same degree of success with the scatterplots because of scaling and scope issues.  In addition, there were addition details such as p-values and r-squared values that went beyond my current level of expertise; something that I will have to revisit when I acquire more knowledge in statistics, hypothesis testing, and how it relates to data analytics and visualization as a whole.

Leave a Reply

Your email address will not be published. Required fields are marked *