Imagine if a big Hollywood film turned people away on its opening weekend due to technical problems. Now imagine it happened so often that it wasn’t even a surprise anymore. That’s the situation we find ourselves in the game industry, where it has become more and more common over the last ten years for games to have major launch problems affecting the gameplay experience or even rendering them completely unplayable.
For players, this is incredibly frustrating. They’ve just spent their hard-earned cash buying a game, excited to play it for the first time, and … it doesn’t even work. For game developers, this is a customer support nightmare and a PR disaster. It’s a major threat to their business.
So why does it keep happening? It comes down to the fact that most games are now online services, with updated content, special events, virtual economies, player interaction, etc. All of these things require backend services — services that need to be built and thoroughly tested before launch, and that need to run reliably from that point on. Done right, a game’s backend services provide for a great customer experience and make possible the kinds of live game operations techniques that foster an engaged pool of players, happy to spend their money long after the initial release. Done wrong, they can absolutely kill a game.
In 2012, EA was having so many server problems after releasing The Simpsons: Tapped Out that it pulled the game from the app stores. After fully redesigning the backend, EA released the game again 5 months later. Many game studios would not have been able to afford an event like that, but EA was able to see the changes through, and the game became a huge and enduring hit after returning to the market.
If backend services are so important to the success of games, then why do so many games ship with backends that are not up to the task? After all, pretty much any consumer product needs backend services these days. Facebook, Amazon and Twitter aren’t going down constantly (well, Twitter was for a while in the early days, but they’ve got it figured out now). And when you take a look at some of the notable games with launch problems — Batman: Arkham Knight this year, Sim City, Diablo III, The Simpsons: Tapped Out in previous years — these are big budget productions by experienced, capable teams.
The answer lies with two core facts: Most game studios are still not geared for online services, and games face backend challenges that are different than those faced by traditional online services. Just as a lot of game developers don’t understand all the issues of an online service, backend engineers often don’t understand all the issues of online games.
Game development is a different beast
Many of the smartest and most creative programmers I’ve ever worked with are game developers. They are really, really good at getting a machine to do incredible things. The math they use routinely for things like graphics rendering, physics and AI tend to make my head explode. But they just aren’t experienced in designing and building online services and at keeping them running 24×7, and they aren’t always in tune with many of the software engineering best practices that most engineers take for granted.
Take testing. Traditional gameplay testing — play the game a bunch of times each day and make note of anything that seems broken — is nothing like the kind of functional and load testing needed to verify that a service will behave correctly at scale. The final phase of testing for games is usually a private beta of several thousand players who have agreed to report bugs and provide feedback on gameplay in exchange for a sneak peek at the game. That works for flushing out hardware specific bugs and game imbalances, but ignores the issue of load. It’s no surprise then when things fall over after launch, when traffic is 100x or more than the highest it had ever been tested to. Load testing is absolutely critical to building reliable services.
Game developers are also still getting used to the concept of continuous operations. The traditional culture of a game studio embraces periods of crunch time in order to crank out features before a release deadline, but finishing up a backend implementation right before launch when low on sleep is a recipe for disaster. Moreover, there isn’t always a sense of urgency when something breaks after hours. They now have these complicated, fragile backend systems in place which are understood by only a handful of people in their company, yet they don’t have sufficient monitoring to detect when something does break or the structure in place to ensure that there’s always someone available to fix it.
Games have unique backend challenges
The second, and perhaps most important reason that games have so many problems with their backends is that games share some characteristics and requirements that make building services for them particularly challenging — in some ways more challenging than for other kinds of consumer online services such as e-commerce or social networks.
To take just one example, think about how online services grow. As big as Facebook is today, it didn’t get that way overnight. While their growth rate has been incredibly high, it has been at least somewhat predictable. Facebook’s engineers successfully predicted the points at which their future traffic loads would overwhelm their current systems, and they re-designed and updated their services to stay ahead of the curve.
With games, on the other hand, it’s common to have your peak player counts and load on the very first day of release. That means that a game’s backend services will be stressed to the maximum before there’s been any time to observe how they perform under realistic load and to flush out bugs. It’s incredibly hard to predict what that level of traffic will be, particularly for mobile games. The range of launch day player counts depending on placement in the app stores ranges from a million or more for “Editor’s Pick”, a couple hundred thousand for “Featured”, to barely any if not featured.
Dealing with unpredictable load
The solution to the problem of unpredictable load is building services that scale elastically. This is a pretty well understood and widely used approach, but with games, it’s particularly important to make sure that both computing resources and data access throughput are elastic.
Games are very demanding with data. The write loads are very high with respect to read loads, because as players interact with the game, they are constantly generating data that needs to be reliably stored. For example, statistics for things like points and levels, virtual currency balances, virtual goods inventories and game save state are changing all the time. They must be stored reliably. Read loads are also high, and in many cases, they must be strongly consistent. For example, if a request for a virtual currency balance returns a stale value, that could cause big problems. Also, players are very observant and will notice and complain about any inconsistencies. This rules out the common scaling technique of returning cached results for read data.
The requirements of elastic read and write throughput, strongly consistent data and reliable updates led us at PlayFab to choose DynamoDB from Amazon Web Services (AWS) as our primary data store, a choice we’ve been very happy with. Its main distinguishing feature is that it can scale to arbitrary throughput levels, in terms of read and write operations per second, and the costs are linear with the throughput levels you choose. This is amazing, as it eliminates what is often the ultimate bottleneck in a services scalability — its database. It comes with a trade-off though: it offers extremely limited functionality compared to a traditional relational database. It does not support multi-item transactions, table joins, foreign keys constraints, and it only supports consistent read queries on a table’s primary key.
The lack of transactions in particular is problematic. Consider the common scenario of purchasing a virtual good with a virtual currency. The logical steps are:
- Decrement the player’s virtual currency balance
- Instantiate virtual good item in the player’s inventory
What happens if the second operation fails? In a relational database, you would wrap both operations in a transaction, but with DynamoDB you can’t. The player would be charged for the item but not receive it: not acceptable.
To work around this limitation, we’ve effectively rolled our own transactions by changing the logical steps to the following:
- Write a message describing the operation to a durable queue
- Instantiate the virtual good item in the player’s inventory
- Decrement the player’s virtual currency balance
- A few seconds later, read the message from the queue and repair the results if there was a failure in steps 1-3
This way in the worst case, the player receives the item in their inventory and isn’t charged immediately. In practice, this has proven to occur very rarely and has been an acceptable tradeoff.
DynamoDB does not adjust read and write throughput capacity based on load. Early on, we were bitten by this when one of our games went on sale and the traffic spike caused database throttling. Now, we run a script that monitors the current read and write loads and adjusts the throughput limits up and down appropriately every few minutes, allowing for some head room.
Game servers: Low latency, high capacity
Elastically scaling compute resources falls into a couple of categories for us. First, we have a core set of web services for things like player authentication, inventory management, leaderboards, matchmaking etc. These services run on AWS EC2 instances behind Elastic Load Balancers configured for auto-scaling based on CPU utilization. This configuration is pretty much standard.
The other category is more specific to games. Many synchronous multiplayer games, meaning games where people play against each other in real-time, make use of game servers. A game server is a process that the players’ game clients connect to for the duration of the session or match. They provide low latency channels for messages to travel between players, where a message is something like “player’s position changed” or “player cast spell.” These connections are highly sensitive to latency — more than a few tens of milliseconds cause noticeable effects like stuttering. For this reason, we try to connect players to game servers in their same region.
Another use for game servers is to enforce the rules of the game and detect cheating. For example, if a game client claims that the player defied the game’s loose interpretation of the laws of physics by jumping too high or shooting a bullet through a wall, then the game server can detect it as cheating and take appropriate action. Doing this involves the game server performing some very complex operations like physics simulation and collision detection. These operations are computationally expensive, so a popular game with a high number of concurrent players in matches requires a significant number of machines to host its game servers.
To satisfy the goals of low latency and high capacity, we have clusters of machines in many regions around the world. Running all these machines is expensive, so we’ve optimized the costs by provisioning a set of dedicated bare metal servers to handle each game’s baseline load. These servers are significantly less expensive on a monthly basis than EC2 instances. When demand exceeds the capacity of these machines, we have an auto-scaling service that starts up EC2 instances in the same regions to handle the extra load.
Because there can be rapid increases in demand for these game servers, it’s very important to be able to add new instances quickly when this happens. A technique that we’ve used to help with this is to pre-bake everything the game server needs into an AMI (Amazon Machine Image) rather than run a configuration script on it when it boots up.
Making it all work: The Loadout experience
The chart below shows the daily active users of Loadout, a shooter that launched on our platform in early 2014. Based on the estimates of its developer, Edge of Reality, we pre-provisioned 10 dedicated machines to host its game servers. In the days leading up to the public launch, Edge of Reality ratcheted up its private beta to a few hundred people in preparation.
About two hours after it launched, with prominent placement in the Steam storefront, there were over 30,000 players in matches running on over 200 machines. The only reason the game stayed up is that we had the elastic game server capacity up and running.
The lessons I learned with Loadout, like the lessons I learned in my years at Uber Entertainment building the backend for games such as Planetary Annihilation, have all of course gone into making the PlayFab platform even stronger and more stable.
That means our customers can benefit from our past mistakes instead of making their own. In a world of dedicated backend engineers focused exclusively on games, launch problems will someday seem as antiquated as blowing on game cartridges.
Matt Augustine is the CTO and co-founder of PlayFab. This article is adapted from a talk he gave at the Aspen Software Leadership Summit last month.