Un très bon article sur les analyses de peformances chez Netflix ...
“Most current performance tool methodologies are so 1990’s.”
TRUE THAT
la stratégie est intéressante, pour le scalling de websockets.
Le fait d'utiliser ELB en frontal, je pense que ça évite d'avoir à utiliser keepalived ou such avec une VIP. Du coup, HAproxy > * as usual :)
un exemple d'architecture bien pensée
you don't get to choose when scaling challenges comes up ... True story bro, j'espère que je n'aurais pas à faire face à ce genre de connerie où le choix n'est pas fait en amont et tout s'effondre
We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2.
Fork Rate. A strange configuration issue caused processes to be created at a rate of several hundred a second rather than the expected 1-10/second.
Flow control packets. A network configuration that honors flow control packets and isn’t configured to disable them, can temporarily cause dropped traffic.
Swap In/Out Rate. Measure the right thing. It's the rate memory is swapped in/out that can impact performance, not the quantity.
Server Boot Notification. Use an init script to capture when servers are dying. Servers do die, but are they dying too often?
NTP Clock Offset. If you are not checking one of you servers is probably not properly time synced.
DNS Resolutions. This is a key part of your infrastructure that often goes unchecked. It can be the source of a lot of latency and availability problems. On the Internal DNS check quantity, latency, and availability. Also verify External DNS servers give the correct answers and are available.
SSL Expiration. Don't let those certificates expire. Set up an expiration check.
DELL OpenManage Server Administrator (OMSA). Monitor the outputs from OMSA to know when failures have occurred.
Connection Limits. Do you know how close you are to your connection limits?
Load Balancer Status. It's important to have visibility into your load balancer status by making the health stats visible.
Très bonne parution, décidément
Local copies do not protect against site outages.
If you have a flood in your server room RAID doesn’t help.
Google File System (GFS), used throughout Google until about a year ago, takes the concept of RAID up a notch. Using coding techniques to write to multiple datacenters in different cities at once, so you only need N-1 fragments to reconstruct the data. So with three datacenters once can die and you still have the data available.
excellent article du programmeur de graphite, il explique beaucoup de choses a propos du design/scalability/choix de python
l'a l'air bien ce truc
A tester