An application's infrastructure is typically designed around its needs. In an ideal world (and most blog posts), those needs are outlined clearly before any work starts, and everything scales smoothly and automagically. In the real world, however, it almost never plays out that way. It didn't for my game Elethor.
Elethor is an idle RPG PBBG set in an alternate future. Players log in occasionally to craft items, upgrade their items, interact with the market, chat with other players, etc. Activities typical in a PBBG. The actions in Elethor are server-sided and tick-based. Every 6 seconds, players who have actions fight monsters and get drops. Every 60 seconds (on average), players gather ore from resource nodes. This is also relatively typical, although some games run their actions as HTTP requests, meaning players need to logged in with an active tab for their actions to tick.
I am the solo dev for Elethor, a passion project that has gotten more traction than I anticipated. It also broke in ways I hadn't anticipated. As a full-stack developer at my day job I dabble in a bit of everything, but devops and infrastructure are easily my weakest points. So take this all with a grain of salt. Or two. I'm learning as I go.
This is my infrastructure story. So far.
Phase 1
Phase 1 started as it probably does for most projects.
All the game code ran on one server, and the database lived on another server. DigitalOcean is my provider, and I utilize their managed database servers. As I mentioned earlier I'm not a devops guy, so the less I have to worry about, especially with crucial game data, the better.
The services include:
- Postgres as the database (managed by DO).
- PHP (Laravel), what the back-end is written in.
- A NodeJS server for websockets. I receive data from the client via HTTP requests, but utilize websockets to send events to the client.
- Laravel Horizon, a queue processing service to handle queueing and executing actions.
- Redis for caching and as storage for the queue processing service and notification system.
- VueJS as the front end.
It's a bit convoluted, but the combined components perform well without major bottlenecks.
Deployment
Luckily I had experience using Deployer, so integrating it with Elethor was a breeze. This automated a majority of my deployment tasks into one script, making it impossible to make silly mistakes (like not invalidating a cache, or forgetting to cache-bust assets). Manual deployments should be avoided at all costs.
Database vs Cache
The database is the primary source of truth for all data. The cache holds information that either changes infrequently or in very specific instances and is extremely heavy to collect in the first place.
For example, players can equip up to 16 different pieces of gear. Each piece of gear has a tier, reinforcements, energizements, and all these numbers play into their stats that are calculated every single fight. This data is spread out across several tables for each item, making for heavy queries--especially at every 6 seconds per player.
However, the equipment state only changes when a player equips / unequips / upgrades an item. Therefore I store the player's equipment in the cache, and invalidate it on those events. This cuts the number of database queries by an extraordinary amount, and hitting the Redis server once only costs RAM to store it and a bit of network lag.
An example of something not to cache is the volatile and constantly changing market data, however. It's better to optimize the queries, fetching a fresh copy when needed, than to continually invalidate the cache.
But it wasn't enough...
With everything but the database running on the same server, things quickly bottlenecked. Even after doubling my server size twice within a few hours after launch, there were still lag problems.
Phase 2
Optimizing the existing combat actions was the next step. After closely analyzing my code I found I was making almost 50 database queries per combat action. Too many.
Some of these were N+1 queries that needed optimized. Others were variables I was using in multiple places, so I instead opted to create a new class, fetch the data once, and grab the data from the class rather than the database. In other places, I was manipulating the data in the database in several places, so I refactored to only manipulating the data within the class and then committing the final bit of data to the database.
I managed to get the number of database queries down to 8 per job, and only 3 of those were updating existing data. This got my action processing time from 250-300ms down to 120-150ms. Still long, but significantly faster.
Phase 3
Separating out services is the best way to horizontally scale an application. Luckily for me I already utilize Laravel Horizon for action queueing, which lends itself nicely to being split off.
I created another server (we'll call it Server#2) to process fighting actions (the ones that run every 6 seconds) exclusively. The other queued jobs would still be handled on the primary server.
Splitting out the fighting actions' consistently heavy load keeps the primary server load at 50% of max a majority of the time, only spiking during patch releases or peak hour times.
Currently the main problem on Server#2 is the network lag time of reaching out to the Redis server for cache. It's still significantly faster than hitting the DB directly, but I hit it often enough during each fight action that the network time is the primary slowdown.
Ideally I would fix this problem by grouping together my Redis calls up front rather than spreading them throughout the code, however that will require a major refactor.
It's things like this that I can't imagine ever anticipating while developing this game in beta with 30-40 active players.
Phase 4
I'm not here yet, but I have plans on how to improve the infrastructure. At this point, I can horizontally scale combat actions easily, which is my primary server usage. However at that point, upgrading servers (and deploying code) gets a bit trickier to manage.
The biggest thing I will tackle is most likely the network traffic of requesting resources from the cache and the database. This might mean setting up a Redis cluster, it might mean lumping my Redis requests together, it's not for sure at this point. But I'm confident as soon as I scale the servers and do some marketing, new problems will show up as more players arrive.
Questions I Asked
Serverless?
Serverless is great for quickly scaling up an application and handling unpredictable server loads.
Serverless can also be incredibly expensive for maintaining that scale.
One of the benefits of running an idle game is that the game load is incredibly predictable. I know players can only perform one combat action every 6 seconds, and these actions run 24/7 around the clock. My server usage is extremely level for days at a time.
The server I run the fighting actions on costs me $80/month. It processes anywhere from 80-120 "actions" per second.
To process the same actions (again, just combat actions) on AWS Lambda (a popular serverless provider) would be about 210m jobs per month on average. This would cost ~$2,000/month on the cheap side.
I opted to use a dedicated server.
Kubernetes?
My dev environment is all Docker containers and it's a joy to work with. My experience with Docker pretty much ends there, and I'd rather spend my time developing new features for the game than altering the server infrastructure with tools I'm unfamiliar with.
Not to say I'll never go with Kubernetes, but it's not an active plan I'm pursuing.
Closing Thoughts
This has easily been the best learning experience (as far as code goes) of my entire life. Being thrown into the fire with hundreds of players screaming when buttons take too long to click forces you to learn and adapt quickly.
It's also been incredibly difficult. I'm reaching the point where posts about "scaling architecture" on Medium don't cut it anymore. I'm not scaling a brand new application where I can design it the way I want. I can't just throw "Kubernetes" at the problem and fix it. Elethor is a real-life application with real users that needs 100% up-time. Taking it down for a week to change the way the servers work isn't an option.
That said, you can only plan so far. Given a month before launch, I would never have anticipated the problems I've faced nor had the foresight to prepare. Premature optimization is a real problem, and I'd rather have a shipped product that needs fixed than a shiny prototype that hasn't seen the light of day.
Your Questions
What did I miss?
What did I glaze over too quickly that you'd like to hear more about?
Leave a comment and let's start a discussion! Chances are you're not the only person wondering, so give back to the community a bit and ask your question!