Parsing 50k Games of Dota 2
For the last few weeks I’ve been on a mission to re-parse all the Dota 2 professional matches that have been played since Source 2 came out (just after TI5). This little endeavor has provided a good backdrop to talk about some of the behind-the-scenes work at datdota.
Why?
Dota 2 has changed a lot over the years. Some things have expanded in predictable ways: new items, new abilities, and new heroes. These are handled easily and immediately as patches come out. Other changes have happened in more unexpected ways: outposts, shrines, watchers, tormentors, talent trees, Aghanim’s shards, the list goes on. As the game has evolved, new statistics have become interesting to track - however some have come and gone so quickly it can be safer to wait to see if new stuff stays around before committing to track it. This means there’s been times when either mildly interesting stats are never tracked, or possibly tracking starts from some point in time.
Bugs have also existed in the parser at times: either because of logic mistakes (it just so happens that you shouldn’t count Lone Druid’s Bear as a hero-kill in teamfights!), or outright mistakes in how Valve calculates and/or displays something (five Meepo’s dying at the same is only a single death!). So for the purposes of accuracy and consistency, it made sense to patch up the issues and do a full re-parse.
How?
There’s two main sources of Dota 2 data. Firstly, the Valve WebAPI provides a tiny amount of top-level data on each match, and allows finding the matches associated with tournaments (leagues). Secondly there are replay files: protobuf-encoded binary files which contain a full ledger of the events that occur in the game. These are large (~120MB when uncompressed) and complete (almost every action that occurs in a game is included) server-side recorded files.
Extracting raw data from them is made relatively easy through parser libraries like clarity; however assembling the data in a meaningful way can be quite tedious (the datdota Source 2 parser codebase is around 5k LOC of Java). There are myriad edge cases, oddities, and unexpected rabbit holes. The output from this parser is, in our case, JSON - either as a simple file (when invoked on the command line) or exposed via a HTTP endpoint when the parser is run as a webservice. The size of this JSON blob varies based on the length of the game and number of things that are happening - but is generally 150MB to 1200MB.
Below is a diagram of the parsing setup - decoupling the parser from the backend has been helpful for a few reasons: the parsers can be horizontally scaled over multiple machines, the parsers can run a different Java version to the backend, a caching layer can sit between the parsers and the backend (for example varnishd). This approach has worked well for other uses - Windrun has a similar setup and has since parsed ~18 million matches of Ability Draft games – although it extracts just a relatively tiny amount of data from each match (~10KB).
Parser Design Goals
It’s always been a goal to make the parsers as backwards-compatible as possible - the same code running on all matches with as little branching as possible.
Ideally the parser would be able to determine for itself how to handle specific situations where branching is unavoidable. If an object or field doesn’t exist (let’s say the m_iTormentorKills property doesn’t exist on a player entity) the replay is probably from a time when that data isn’t available. Rather than having logic in the parser to set when certain features do/don’t exist, we just need to emit sane defaults; and then can do Q&A on the parser output.
Unfortunately, in 2022 there were some changes which broke this paradigm - player IDs changed from (0, 1, … 9) to (0, 2, 4 … 18). There were two options: parse the replay twice trying to detect which version we were in on the first run; or look up the match ID and have custom logic based on this. The second option was much cleaner - but now some amount of global state needed to be tracked per-parse.
By The Numbers
Just under 49k replays were re-parsed, representing 43.2 months of playtime of Dota 2.
Parsing time (including ingesting into postgresql tables) took 37 days, and generated 8.56TB of JSON files. Whilst DreamLeague S22 was on (for 14 days), 1 replay was being parsed at a time; otherwise 2 replays were parsed concurrently.
So far 2024 has a 99.71% parse success rate - the best year ever. 2023 is the next best year, with 99.43% success. Overall Source 2 success rate is 98.69% — 95.65% of the errors in Source 2 have been because the replay file is not uploaded correctly to the Valve CDN.
What’s Next
Datdota has been a passion project for me for a long time, so it’s often just a question of finding something interesting to work on that’s suitable for the time gap I’ve got (mostly holidays!) and the energy levels I’ve got (hell yeah holidays!).
Recently I’ve finished off a few backend features (the biggest two being the query language rewrites, and the new events) and have a few more I’d like to finish off (faction selection, ‘against team’ filter, tower falling order, megacreep comebacks).
In addition to this, I’ve started to chip away at making a parser for Source 1. So far I’ve finished 6 of the 13 processors (using the same structure as the Source 2 parser, as well as the same serialization models). This is a bit deceptive though because the Events processor is 12x the number of objects emitted (but each is quite simple), and on the other hand the Frames processor emits a single object type which has 116 properties. I’d estimate that I’m around 30% through - but I might deploy it when it’s partially done. One thing is for sure - the Teamfight processor will not be ported to Source 1; it was a multi-month project to get working correctly and I’m not going through that again.
Overall a bunch of fun was had by all, a few bugs were fixed; and a few more features were added. The Postgresql DB grows ever bigger!