How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures

How to Test the Ability of Large-Scale, Distributed Software Systems to Cope with Failures

We conduct functional testing, check performance, write unit tests. However, all these activities may not be enough when it comes to large-scale, heavily loaded distributed systems.

  • What will happen to your distributed system in case of network segmentation caused by network problems?
  • Will your system respond correctly to the failure of cluster nodes?
  • Are you sure that your database does not lose data?
  • Have you ever thought about the reliability and security of your system?

In this presentation, I will share my story on how adopting the experience of Amazon, Netflix and Twitter I created a framework to test the ability of the system to cope with failures. You will learn what technologies and approaches can be useful for testing distributed in-memory systems.

Schedule:

Room:

Albert 2-3
Speakers
Pavel
Lipsky
Principal Software Engineer
at
Dell Technologies
Pavel is a principal software engineer at Dell Technologies. He is an expert in performance optimization and testing the ability of distributed software systems to cope with failures. Pavel spent the last decade building high-load software systems for companies around the world. Before joining Dell Technologies, he worked for Rambler, Opera Software and Sberbank. Pavel actively contributes to open source communities.

Slides & Recordings

   Download Slides