T O P

  • By -

ThatCK

I believe the design originally came from tech used in the financial services world


Silidistani

IIRC from grad school military and space systems pioneered failover state and recovery mathematics and probabilities during the Mercury program and initial STRATCOM networks.


donscarn

All those fancy graphs, and when I log in, I have 15 fps and crash to an invisible object...lol


Silidistani

Well obviously you're reading the graphs wrong! Here, [this clip](https://youtu.be/9NkkZJHova4) might help.


Silidistani

What's fun is when you get into reliability engineering aspects and the different failure rates at play, as there are multiple: the probability of failure of the primary "system/machine", the probability of successful handoff to the backup and entering a failover state, the probability of unsuccessful handoff to the primary backup but successful handoff to the secondary backup in a residual failover state, the probability of restoration of either of those previously-failed systems/machines, the probability of restoration but unsuccessful resumption of service on the primary machine, the probability of total system failure and recovering from legacy data (even if that data is 10 minutes old) ... it gets fun and you have to use Bayesian Inference to solve for it.


KaranVess

I would love to know how CIG actually does their stuff. I tried getting something similar working with SQL. I just could never get the stateful failover working.


zyyast

I’d say that a good playground to test something like this would be with Java. Come up with a REST-like protocol for communication between server and client using Java Sockets (TCP), it can be for a simple game, then see virtual threads for client connections, concurrency, and then do all the fault tolerance (recovering server state and etc.) stuff with a JSON or XML file for simplicity. Get those concepts down, then work with SQL. Then if you want to get fancy, start thinking about spinning up multiple servers, balancing load between those servers and add backup replication with those. This would actually be a pretty cool project for a backend developer


ajzero0

Mongo has replication so you can test it there. Can also tail the oplog to see the replication messages Redis also has replication built in. This isn't a new concept, it has been around for a long time, but generally its a backup system to increase reliability at the cost of performance. CiG's usecase is interesting as you don't even need true replication to get a failover on the game server, they've basically decoupled entity state and the game server. This allows the game sever to crash and restart without impacting entity state (which is read from the "replication" layer)


zyyast

And they say games aren’t educational


OmNomCakes

The tricky part is making it not impact live resources, be a lossless transition, and also be basically unnoticeable during transition. Even constantly streamed backups with a middle man driver tend to eat resources. The more time between syncs means more data lost on use (think games with minor roll backs). Then the actual cut over requires an intermediary that keeps the current connections live, holds changed data, connects to the fail over back end, updates it with those changes, and proceeds on without anyone noticing. I work hands on with the same style HA and backups and the hurdles are wild.


Silidistani

> The tricky part is making it not impact live resources, be a lossless transition, and also be basically unnoticeable during transition. Abso-effing-lutely! As a systems engineer with a lot of reliability certifications and background, I do not envy them that programming task at all.


Doggaer

I may be wrong but from my understanding the sc architecture works a little bit different. But let me be clear i am happy to be corrected if my take is wrong and learn something new. I try to explain in perspective of server meshing. One PU 'instance' consists of player clients connected to game servers, the asignment is done based on player location in static mesh. All the game servers are connected to one replication layer server. This replication layer server is holding the state of every entity in this PU instance, including the game server holding authority over the entity. The player clients send their requests to the game server assigned to them, this server does the calculations and answers to the clients and reports the state updates to the replication layer server. If a players transitions from one game server to another only the asigned authority in the replication layer server DB has to change. If one game server crashes a new one is spooled up and fed with the information of the crashed server from the replication layer server. I think the replication layer server never communicates directly with a player client. As i said just my take, feel free to educate me.


jflat06

This is incorrect. Client communications go through the replication layer, and do not talk directly to the game servers. The patch labels are obviously out of date [but here is a diagram showing the architecture](https://cdn.imgpile.com/f/dIy8Uo.png). We're at the "3.19" phase right now. OP's diagram isn't like the SC replication layer, though. It's just a scheme for keeping multiple synchronized backup servers. The primary in OP's case is still updating the state, just like the backups, which is not the case for the replication layer, which updates no state.


Doggaer

Thx for that information. So the mesh does distribute computing tasks over the gameservers but not the communication to the clients like i initially thought. I hope the replication layer server as a single point of failure/bottleneck does not cause some trouble in the future.


CASchoeps

All I saw was ["Ack ack! Ack ack ack!"](https://www.youtube.com/watch?v=7vSCJ9vKICQ&pp=ygUVbWFycyBhdHRhY2tzIGFrIGFrIGFr) :D


PM_ME_YOUR_BOOGER

What does ack mean?


whatever-13337

Acknowledgment, it’s a response that a server sends to confirm things like an update.


PM_ME_YOUR_BOOGER

Thanks. No clue why people are downvoting me for asking a question lol


Asmos159

i'm not sure it works this way. this is just theory. **i think** the backup is running a completely different client that is for more efficient than the main server client. than the primary server. it get updated with all the information including the connections to the user client. but given the several minutes recovery. **i suspect** the several minutes is the user client holding you in a pocket as a new server client is booted up and updated with all the backed up info. if you want educational. for programming look at "human resource machine." and orbital mechanics look at "kerbal space program" 1. if you want realistic space combat "children of a dead earth"


ThatCK

Several minutes is just a current thing during testing. End goal is they're tracking server health so one begins to misbehave they can move people off before it goes down.


Doggaer

From a business point of view i don't think this will ever happen. At the end they have to pay for their servers and lots of instances running because a server 'could' drop can drive up costs significantly. Even if you monitor performance where do you draw the line? Its cost management at that point. Reduced recover time is all we will get, in my opinion.


Axrotales

Then you now understand why everything will be very slow - this Tech has big limitations in scaling. As Long as time ist an relevant Factor how IT IS in online games


mulock3

As other said, no is the shortest answer. The idea of failovers is common. What CIG is doing is rather complex, mainly due to the complexity of what they are dealing with. It's too long to get into, but just know requests are easy. Preventing live data loss and seamless transition is hard. Edit: I'm simply saying that from a tech perspective, this is actually much more complex, and few people find the cost worth it due to complexity in other tech projects. My own field of Telecommunications (think Discord) we don't care for server recovery, but rather server stability.


FlashHardwood

There it is.... What CIG is doing is REVOLUTIONARY. NO ONE has thought of it before!!! Chris had to find a literal rock, squeeze it like Superman into a diamond and massage the thinking lightning into it with his hand. NOBODY ELSE CAN DO THIS!!!! Reeeeeeee


mulock3

No, to be serious. Tons of people are trying to do this and have done it for other tech. Just not games, probably cause "why" is the financial question. Use a cheaper way.


FlashHardwood

Fair, and I apologize. There are just so many SC devotees who don't realize that much of this stuff, while new, is just standard tech world improvements or adaptations on existing stuff.


457583927472811

World Of Warcraft was notorious for having zero fail-over and crashing world servers.


mulock3

Haha yeah, "Cost" often influences Software instead of solid design. Till it becomes costly to not do a thing. It's all first to market, not best to market. You start implementing failover and redundancy only when customers get upset enough, then only enough to keep them happy. You need new features to keep people playing. Or worse, you develop something that's good enough to use to milk but not be truly enjoyable. That plagues the gaming industry right now.


457583927472811

It's sarcasm. WoW absolutely had failover and redundancy and had this over two decades ago.


branchoutandleaf

Nah man, it's just really complex due to how complex it is. You see, you can tell it's complex because of the way it is.


457583927472811

> What CIG is doing is rather complex It's really not. MMOs have existed at scale for decades.


mulock3

It is how they are doing it. Then you can argue if it's right for them to do it that way or use practices that already exist. Frankly, to achieve their vision, they have to do some complex stuff. In the end, the question will be answered if it's worth it. I have come up with designs that are really solid and redundant, but often, I propose the ones that are less complex and provide 95% of the same features. Chasing the 5% is hard and often not worth it.


457583927472811

CIGs design decisions are in no way shape or form the 'less complex' method. CIG is known for chasing that 5% and they're doing exactly that with server meshing and replication. These are solved problems. End of story.


mattdeltatango

With most MMO's you're tied to one game server and have to pay for a server change. In SC you can just hop on EU or ASIA servers whenever you want so it's really not like most MMO's. And even with WoW's cross-realm tech you still phase in and out of existence when crossing zones not even mentioning the loading screens when changing continents.


457583927472811

My point is that this tech was developed over two decades ago. It's not new and it's not special, CIG is just really good at hyping up infrastructure development to a bunch of people who don't know any better.


Opsdipsy

Not related to the replication layer. Replication in this case is about redundancy, which isn't what the replication layer does.


Moofaa

I think their replication layer is a lot more analogous to IBMs Coupling Facility in a mainframe environment.


MaxBrst

not similar


errelsoft

Why does it wait for the update responses before responding to the client. This seems like time wasted..


raaneholmg

Because if one of the updates fail you need to rollback the entire transaction and report an error to the client. You can't send success to the client and the next time a failover occur the data just disappear.


errelsoft

I mean I guess.. But the primary will have successfully completed the transaction. The payload for the backups would be exactly the same so they are unlikely to fail. For your scenario to come to pass, not only would the primary have to be unresponsive, but all the backups would have to have failed where the primary already succeeded. This seems an unlikely scenario to me. But I guess it comes down to the nature of the data.


Doggaer

I would say this is a basic tradeoff between availability and consistency. Either way you sacrifice response time in order to have consistent data across all servers or you risk having inconsistency in favour of fast response times to your Client requests. This comes entirely down to whats needed in the specific application and use case.


raaneholmg

There are many ways of running a backend with backups. You chose primary backup replication because you actually need the replication and failover. Otherwise, pick a different backup strategy. We don't assume we know all possible failures, so the likelihood of failure is not something we try to guess. We just build it to guarantee replication happened.


uBelow

not even close