Zero_Cool Apr 2, 2014 @ 11:30pm
Possible solution for problem with steam servers.
Hi, thanks for reading this and sorry if I'm suggesting something you already know or if what I'll say is out of place or is something stupid on my part.

The thing I want to talk about is the next one, why don't steam makes a high-availability cluster with their servers to prevent data loss or that servers stay down for long periods of time? (or prevent users from noticing those problems). If I remember correctly there are several types of cluster useful, including the following two:
- High availability infrastructure: If a hardware failure occurs on one of the machines in the cluster, the high availability software is able to automatically start services in any of the other machines in the cluster (failover). And when the machine failed recovers, services are migrated back to the original machine (failback). This automatic recovery services ensuring high availability of services offered by the cluster, minimizing the perception of failure by users.

- High Availability Application: If a hardware failure or application of any of the machines in the cluster occurs, the high availability software is able to automatically restart failed services in any of the other machines in the cluster, and when the machine failed recovers, services are migrated back to the original machine. This automatic recovery service ensures the integrity of the information, there's no data loss, and also avoids inconvenience to users who do not have to note that there has been a problem.

Just have to decide which one to use and how to do it, I personally recommend doing with MySql Cluster although there are more alternatives. I hope this will be helpful or good for something. Greetings.

(Sorry if my english is hard to read, hehe, I tried to give the best of me, I promise my spanish is better than my english xD )
CharlestONE Apr 2, 2014 @ 11:39pm 
Your English is awesome. I can't comment on the substance of your idea though, as I know very little about server/hosting technology. Never the less, it was an interesting and well written piece.
Scutterman Apr 3, 2014 @ 3:57am 
Your English is, indeed, awesome. I couldn't tell that you weren't a native speaker.

What you're referring to can also be called "Redundancy". I don't have any knowledge of Valve's infrastructure, but imagine they already have these measures in place.

While Valve has been pretty quiet about the downtime, one opinion is that it's caused by a Distributed Denial of Service (DDoS) attack. These are harder to manage, and they need sophisticated analysis of requests to the Valve services. If it is DDoS attacks causing the issues, hopefully Valve is working on a more permanent solution.
Zero_Cool Apr 3, 2014 @ 9:46am 
I already know that this can cause redundancy but that way valve servers can recover from any situation because if there is a problem in one of the server the cluster will move all the services to another one and also a cluster can use this method to speed up some task because if one server is too busy it will send some of his services to another server ,so, no matter if it is a DDoS attack or a simple hardware problem or if server are too busy they will be working. Would not be beneficial to reduce the fall time of the servers or the lack of services?
But Valve have to decide what to do, I'm just an amateur gamer who wants to play 24 hours a day, 7 days a week and if the servers are down we can not play, right?
Satoru Apr 3, 2014 @ 10:03am 
Making a 'highly redudant world wide' infrastructure is really realy really complicated. It's not just about clustering and such. You have issues with scalabiltiy and the abilty to ramp up when necessary. There's a reason why places like Google/Amazon/Facebook build out their own networks, network protocols, disk storage arrays (not SAN), database protocols(not SQL/Oracle), custom hardware, etc.

People think you can just 'thorw more servers' at the problem. But when you get big enough, your problems end up being more systemic, and not 'the server is too slow' or 'you need more ram' or 'you need a cluster'. Because you alrady have all those things.
Last edited by Satoru; Apr 3, 2014 @ 10:13am
Black Blade Apr 3, 2014 @ 10:07am 
(Great English.. i wish i had 50% of that XD)

Well what you saying will not really help a DDoS attack that is done right.. as all the servers will be busy then.. that is after all the point of DDoS and really im not really sure
But i think these just prove that they do have these in place... as if they did not what you suggest it means only who was on the servers that got the DDoS attack on will have been effected.. and others will not have no problems as there servers are not having it..
Seeing that All Steam had the problem on the DDoS we do know abut i think we can be sure the servers are moving the attack from each other... closing to all (now these is a guess i did not really look how it works like)

Any way the main problem is we have no clue if these things are attacks or are Valve fixing something or Valve maybe even moving there servers from a 3rd party host to there own servers...
We do not know much on what really is going on and making these , there for i think its hard to say how can we really fix it or not

But seeing that Valve only take smart guys to work with them il guess they know abut these type of things and there for all ready have it build in
aiusepsi Apr 3, 2014 @ 11:46am 
Having read your post, I'm sure the Valve guys will realise that their intention to build a low-availability system was wrong, and they will now proceed to make it high-availability instead.

Seriously though, making something high-availability is a lot harder than it sounds. You have to deal with problems like split-brain syndrome and other breakdowns in consistency which are necessary conditions to able to maintain partition tolerance and availability according to the CAP theorem. It's not just a thing where you can flip a switch and it just works.
Scutterman Apr 3, 2014 @ 12:57pm 
Originally posted by Zero_Cool:
I already know that this can cause redundancy but that way valve servers can recover from any situation
"Redundancy" is a good thing when talking about technology, and Valve already has Redundant servers. I'm sure they're doing their best, but the Steam network is one of the largest consumers of bandwidth in America. Working with challenges like that will take time.
Clansm3n Apr 3, 2014 @ 2:12pm 
Yeah, I don't know much about technology, but if it will keep the steam trading servers from going down as often as they do, i'm all for it!
