Getting getaway time out on many API calls

The Movie Database support

Skrevet af hosam
STAFFMOD
d. 8 februar 2014 kl. 8:54 PM

one of the example calls i am making

https://api.themoviedb.org//3/discover/movie?api_key=&with_genres=99&language=en&page=2&vote_count.gte=5&sort_by=release_date.desc

Same thing for list calls

that doesn't happen on every call but almost 70% of the time the call fails

5 svar (på side 1 af 1)

Svar af Travis Bell

STAFFMOD

d. 8 februar 2014 kl. 11:08 PM

Hey hosam,

We had one of our API servers fail and the remaining servers got overwhelmed with the resulting load.

We're looking into ways to make sure our system can handle a single point of failure, everything is duplicated already for redundancy and cases like this (2 database servers, 2 search servers, 2 cache servers, 2 web servers) but our 4 API servers were apparently unable to handle the load with one down. This will be fixed.

Cheers.

Svar af hosam

STAFFMOD

d. 9 februar 2014 kl. 6:36 AM

no worries !, Thanks a lot for your speedy reply as always :)

Svar af Travis Bell

STAFFMOD

d. 9 februar 2014 kl. 10:21 AM

If you were curious about what happened and the fix in a little more detail here's a more elaborate run down.

We use Amazon EC2 for just about everything and they have the concept of availability zones. These are discrete areas within the EC2 service where you can increase your availability by protecting yourself from different issues that can arise from within a zone itself. Anything from hardware failures to network failures--basically things that are out of our hands and problems that Amazon can have.

In our case we had 2 servers in one zone, and 2 others in another. Awesome! ... until one of those instances failed.

The way our load balancer splits traffic is pretty much 50/50 to each zone. And I had originally thought if one server failed, the remaining three would take over the load. Unfortunately that's not quite right. Our ops guy was quick to point that out. It's also why we pay him to do this and not me :D Only the server in the same availability zone would take over the dead servers traffic. So, in the one availability zone we doubled the load on the one remaining box and that much load was enough for it be saturated so it started dropping requests too. So now, we weren't just down one, we were down almost 2. The problem basically compounded itself resulting in a dead availability zone.

To fix this we're now running 8 servers, 4 in each zone. This will make each servers job less important in the whole stack which means if we go down one, we have three times the capacity to take over the resulting load (remember before it was one, now it will be three) per zone.

We have invested a lot to make sure we are able to deliver our service to everyone all the time. Sometimes though, it takes the odd failure to point out flaws in the design. The changes we made last night should help this situation from happening again.

Thanks!

Svar af hosam

STAFFMOD

d. 9 februar 2014 kl. 10:32 AM

thanks for the extensive explanation :). you guys have one of the best support on the planet thats why i am a die hard fan of the website and service.

keep up the great work!

Svar af Travis Bell

STAFFMOD

d. 9 februar 2014 kl. 10:40 AM

You're very welcome!

Brugere i denne diskussion

Kategorier

The Movie Database support