In the past two articles I discussed both the basics of distributed systems and the fallacies of distributed systems. In the final installment of this series, I’ll cover the design considerations and best practices for creating a distributed system.
There are many types of distributed systems, each designed to meet specific business requirements. Based on the priority of requirements, the following considerations may change, but in general, most developers will need to keep these in mind when designing a distributed system:
- Availability – Operational characteristic of a system where it is always ready to handle requests.
- Scalability – The ability of a system to increase its capacity to handle more load. This could be achieved by adding more servers to a cluster.
- Performance – Speed at which a system is able to handle requests.
- Cost – Total cost of ownership of the system. This could include hardware, software, development, testing, hosting, and cloud infrastructure.
- Manageability – Maintenance, update, migration, scaling, and diagnostic should all be manageable.
- Reliability – Able to adapt to the load and respond properly during exceptional conditions.
- Heterogeneity – Ability to support a variety of devices and protocols.
- Fault Tolerance and Failure Management – Systems designed with an expectation of failure will make them more fault tolerant.
- Concurrency – When multiple parts work at the same time (a given for distributed systems).
- Migration and Load Balancing – Closely related to reliability, fault tolerance, and failure management.
- Security – Ensuring proper authorization and authentication between the users and components of the system is key ensuring the confidentiality and integrity of the data.
- Composability and Modularity – Many small subsystems and modules forming a larger system in a way that can be configured and reused in different ways.
Example Distributed System: Traverz
For the purpose of walking through the best practices of designing a distributed system, I’ve created a fictitious application called Traverz, which allows drivers to share traffic information. This information, such as current location, crash reports, road work, and stopped vehicles, is made available to other drivers who are also part of this distributed system. This data is then processed and sent out appropriately to other users, informing them of upcoming road conditions.
When designing Traverz, our expectations and design goals are derived from the business requirements, such as the need to:
- Support the most common devices on the market.
- Support a large number of concurrent users.
- Keep response times very fast.
- Provide near real-time updates based on location and direction.
- Provide alternate routes when a faster option is available.
The Traverz system will consist of many moving parts that are spread out over disparate locations and regions. The diagram above (Fig. 1) is a logical model of the various layers and tiers. On the far left are the client applications that send out updates to the Traverz API end points. Each of the blue boxes represent components of Traverz that can be scaled up independently of each other to meet load and performance requirements.
The devices Traverz will support are Android and iPhones as these are the most common in the marketplace. Due to the rich interactive nature of the application, including features such as mapping, GPS and turn-by-turn guidance, we’ll have to use a native application for each platform. “Native” in this instance means an application that is coded and compiled to run directly on that specific device platform.
Communication and Connectivity
Another consideration is how often the devices send out updates to the servers. As a baseline we could say that each device broadcasts its location, direction, speed, and state every five seconds. This can be adjusted as we test out the system in the field.
Since the devices being used are mobile, connectivity is a major concern, which introduces issues such as limited coverage, switching between cell sites, and heavy infrastructure. These issues can result in data loss, latency, and lower data speeds. To keep the data exchange logic simple, the devices will need to establish a new connection to the server, send and receive updates, and close the connection.
If there is no connectivity, the application will display an appropriate message to the user. In the background, the application will monitor for connectivity and resume sending information.
From the server side, all data sent back to clients could also be HTTP/GZIP compressed to further minimize data transfer size.
Protecting the confidentiality and integrity of the data is key to ensuring the system is not compromised by the bad guys. Standard ways of securing web-based APIs should be used, such as SSL/TLS (HTTPS). This ensures the clients are connecting to a known server, and the communication between them is encrypted, preventing man-in-the-middle attacks.
We want Traverz to handle hundreds of thousands of users. To be able to achieve that kind of scale, as well as keep initial investments to a minimum, we could use one of the many cloud platforms. For this example, I’ll describe a possible implementation using Microsoft Azure.
API End Point
Azure app service is a relatively new offering from Microsoft that allows us to host websites, API end point, business logic, and integration with other applications such as Salesforce. A lot of the modules in Fig. 1 can be handled by this service, such as:
- API end point
- Traverz portal
- Data processing and business logic
- Routing logic
Once the basic APIs and website have been designed and implemented appropriately, scaling up to support N number of users is just a matter of monitoring and configuration.
We can choose from a few database options: SQL Azure or MongoDB. SQL Azure is very similar to SQL Server, but it is a cloud scale version. One of its useful features as it pertains to Traverz is its support of spatial queries. You’ll be able to find other people in the system who are within a certain geographic criteria.
The second database option is MongoDB, which is a NoSQL type of database. In such a system all data is represented as documents. Some of its features are high performance, high availability, and easy scalability.
Another database consideration to remember is the natural division of the data based on the location of the user. This allows us to shard and partition the data into logical segments and divide it across multiple databases.
Throughout this series, I’ve discussed the general concepts and concerns of distributed systems, the fallacies engineers assume during initial distributed system design, and finally the best practices of a distributed system using a fictitious app. There’s no question that increased mobility and the Internet of Things will continue driving the need for distributed systems. So it’s key to ensure that these systems are properly engineered for sound, secure, and flexible performance.
The Empower and Protect Blog brings you cybersecurity and information technology insights from top industry experts at Telos.