Visualizing Steam: An Overview

(Tableau dashboard embedded below post)

What factors help to determine whether a video game is popular on the online distributor Steam?  As one of the largest digital distribution services of video games, the service itself is home to thousands of games, ranging from indie games made by a few developers, to games published by industry giants such as Activision-Blizzard, or EA Games.  As someone who plays games on a regular basis, this was something I have pondered myself for a while now. Is it the fact that games are fairly cheap? Or are games on Steams fun to play, and thus, highly rated? Or is it something else altogether?

Because there are thousands of games available on the service, picking out a game can be quite the daunting task.  Gamers might wonder: what makes a game so popular?  Is it the gameplay?  Or is it the graphics?  Or were they lured in by something not directly related to the gameplay, such as the price point or reviews?  Whatever the game’s main selling point is, we aim to look into the factors that influence its allure in order to make the audience of the dataset better informed decision-makers when it comes to utilizing the service.  Since we will be analyzing many facets of the games, we will explore different categories such as genre, reviews, year of publication, price point, and so on.  There will also be many different tooltips for people who are not familiar with the gaming industry to help navigate through some of the common jargon and acronyms, such as “RPG”, “MMO”, “F2P”; etc.

The visualizations utilized in the dashboards are a mix of standard visualizations such as bar graphs, line graphs, and tables. They help to quantify amounts, and time series data, and serve to tell a story. There are also visualizations such as box plots, and scatterplots, because we want to explore the distributions and relationships of the games sold on Steam. To ease the difficulty of interpreting said visualizations (especially since the dataset is large at over 27,000 rows), there are captions to help explain the data. Although I’m a huge proponent that simple is best, there are times when the more complex visualizations are the most effective at portraying the story you want to tell.

The dataset was pulled from Kaggle, courtesy of the user nikdavis.  There are three datasets, all of which have different uses, from information about the video game itself (i.e: developer/publisher, size, price), to user based information such as time played, reviews; etc.  The dataset was fairly clean, and required minimal cleanup, such as converting certain columns, and creating calculations like averaging the prices and scores.

Were there limitations to the data? Absolutely. While the dataset was one of the most detailed datasets on the platform itself, it stops at the year 2019. I had hoped to obtain data about games from the year 2020, as the Covid-19 pandemic would have generated more interesting data, especially about certain games such as “Among Us”, or “Fall Guys”; these games flourished in terms of popularity and sales in 2020 since people were forced to stay in as a result of lockdowns.

Another limitation to the data was the lack of precise statistics. The number of owners was originally displayed as a range; for example, the data displayed a game’s number of owners as ‘10000..20000’. I understand why capturing an exact number is difficult; people return games all the time, especially since Steam has a fairly generous return policy if a customer isn’t satisfied (Having a column which outlined the number of times the game was returned would have been another visualization I could have made if the data was there). While splitting the column into ‘10000’, and ‘20000’ based off delimiters was straightforward, the more difficult part was to decide which set of numbers to use, or whether or not I should use a mean. In the end, I decided to use the lower end of the numbers because it was safer, although the risk in that was that if I underestimated the numbers, it would be less compelling.

Having finished the first iteration of this project, where could I take this project? There’s still a wealth of information in the dataset; for example, I could do a dive into average playtimes and the pricing for each game. For more advanced applications as I go further into my data science career, I could build a machine learning model to predict what the ratings of a new Steam game would be based off factors like genre, publisher, pricing; etc. The possibilities are numerous.

Leave a Reply

Your email address will not be published. Required fields are marked *