The resulting manuscript strikes a balance between our two objectives, namely to address new and emerging issues, and maintain the main characteristics of the book in addressing the principles of distributed data management. The third edition is coming out at a time when there is renewed interest in distributed data management. The last ten years have seen an accelerated investigation of distributed data management technologies spurred by advent of high-speed networks, fast commodity hardware, very heavy parallelization of hardware, interest in cloud computing, and, of course, the increasing pervasiveness of the web. We hope the book contributes the the renewed discussion on these topics.
With every company becoming software, any process that can be moved to software, will be. With computing systems growing in complexity, systems have become more distributed than ever, and modern applications no longer run in isolation. The vast majority of products and applications rely on distributed systems. Learn what a distributed system is, its pros and cons, how a distributed architecture works, and more with examples.
Also known as distributed computing and distributed databases, a distributed system is a collection of independent components located on different machines that share messages with each other in order to achieve common goals.
Many industries use real-time systems that are distributed locally and globally. Airlines use flight control systems, Uber and Lyft use dispatch systems, manufacturing plants use automation control systems, logistics and e-commerce companies use real-time tracking systems.
There used to be a distinction between parallel computing and distributed systems. Parallel computing was focused on how to run software on multiple threads or processors that accessed the same data and memory. Distributed systems meant separate machines with their own processors and memory. With the rise of modern operating systems, processors and cloud services these days, distributed computing also encompasses parallel processing.
Heterogenous distributed databases allow for multiple data models, different database management systems. Gateways are used to translate the data between nodes and usually happen as a result of merging applications and systems.
In the early days, distributed systems architecture consisted of a server as a shared resource like a printer, database, or a web server. It had multiple clients (for example, users behind computers) that decide when to use the shared resource, how to use and display it, change data, and send it back to the server. Code repositories like git is a good example where the intelligence is placed on the developers committing the changes to the code.
Every engineering decision has trade offs. Complexity is the biggest disadvantage of distributed systems. There are more machines, more messages, more data being passed between more parties which leads to issues with:
Confluent is the only data streaming platform for any cloud, on-prem, or hybrid cloud environment. Connect 120+ data sources with enterprise grade scalability, security, and integrations for real-time visibility across all your distributed systems.
2. Heterogeneous Database: In a heterogeneous distributed database, different sites can use different schema and software that can lead to problems in query processing and transactions. Also, a particular site might be completely unaware of the other sites. Different computers may use a different operating system, different database application. They may even use different data models for the database. Hence, translations are required for different sites to communicate.
This document introduces concepts, principles, terminology, and architecture ofnear-zero downtime database migration for cloud architects who are migratingdatabases to Google Cloud from on-premises or other cloud environments.
database migration: A migration of data from source databases to targetdatabases with the goal of turning down the source database systems after themigration completes. The entire dataset, or a subset, is migrated.
heterogeneous migration: A migration from source databases to targetdatabases where the source and target databases are of different databasemanagement systems from different providers.
Categorizing migrations by data model more accurately expresses the complexityand effort required to migrate the data than basing the categorization on thedatabase system involved. However, because the commonly used categorization inthe industry is based on the database systems involved, the remaining sectionsare based on that distinction.
In some database migration scenarios, significant downtime is acceptable.Typically, this allowance is a result of business requirements. In such cases,you can simplify your approach. For example, with a homogeneous databasemigration, you might not require data modification; export/import orbackup/restore are perfect approaches. With heterogeneous migrations,the database migration system does not have to deal with updates of sourcedatabase systems during the migration.
An active-active migration supports clients writing into both the source aswell as the target databases during the migration. In this type of migration,conflicts can occur. For instance, if the same data item in the source andtarget database is modified so as to conflict with each other semantically, youmight need to run conflict resolution rules to resolve the conflict.
A special case not discussed here is the migration of data from a database intothe same database. This special case uses the database migration system for datatransformation only, not for migrating data between different systems acrossdifferent environments.
The database migration system is at the core of database migration. The systemexecutes the actual data extraction from the source databases, transports thedata to the target databases, and optionally modifies the data during transit.This section discusses the basic database migration system functionality ingeneral. Examples of database migration systems includeStriim,tcVision andCloud Data Fusion.
The database migration system requires access to the source and to the databasesystems. Adapters are the abstraction that encapsulates the accessfunctionality. In the simplest form, an adapter can be a JDBC driver forinserting data into a target database that supports JDBC. In a more complexcase, an adapter is running in the environment of the target (sometimes calledagent), accessing a built-in database interface like log files. In an evenmore complex case an adapter or agent interfaces with yet another softwaresystem, which in turn accesses the database. For example, an agent accessesOracle GoldenGate, and that in turn accesses an Oracle database.
Database migration systems, or the environments on which they run, can failduring a migration, and in-transit data can be lost. When failures occur, youneed to restart the database migration system and ensure that the data stored inthe source database is consistently and completely migrated to the targetdatabases.
These are only a few of the possible options to build a custom databasemigration. Although a custom solution provides the most flexibility and controlover implementation, it also requires constant maintenance to address bugs,scalability limitations, and other issues that might arise during a databasemigration.
In a database migration, time-to-migrate is an important metric. In a zerodowntime migration (in the sense of minimal downtime), the migration of the dataoccurs while the source databases continue to change. To migrate in areasonable timeframe, the rate of data transfer must be significantly fasterthan the rate of updates of the source database systems, especially when thesource database system is large. The higher the transfer rate, the faster thedatabase migration can be completed.
When the source database systems are quiesced and are not being modified, themigration might be faster because there are no changes to incorporate. In ahomogeneous database, the time-to-migrate might be quite fast because you canuse backup/restore or export/import functionality, and the transfer of filesscales.
The cloud is changing the way applications are designed, including how data is processed and stored. Instead of a single general-purpose database that handles all of a solution's data, polyglot persistence solutions use multiple, specialized data stores, each optimized to provide specific capabilities. The perspective on data in the solution changes as a result. There are no longer multiple layers of business logic that read and write to a single data layer. Instead, solutions are designed around a data pipeline that describes how data flows through a solution, where it is processed, where it is stored, and how it is consumed by the next component in the pipeline.
Big data solutions. A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems. The data may be processed in batch or in real time. Big data solutions typically involve a large amount of non-relational data, such as key-value data, JSON documents, or time series data. Often traditional RDBMS systems are not well-suited to store this type of data. The term NoSQL refers to a family of databases designed to hold non-relational data. The term isn't quite accurate, because many non-relational data stores support SQL compatible queries. The term NoSQL stands for "Not only SQL".
The fourth edition of this classic textbook provides major updates. This edition has completely new chapters on Big Data Platforms (distributed storage systems, MapReduce, Spark, data stream processing, graph analytics) and on NoSQL, NewSQL and polystore systems. It also includes an updated web data management chapter that includes RDF and semantic web discussion, an integrated database integration chapter focusing both on schema integration and querying over these systems. The peer-to-peer computing chapter has been updated with a discussion of blockchains. The chapters that describe classical distributed and parallel database technology have all been updated. The new edition covers the breadth and depth of the field from a modern viewpoint. Graduate students, as well as senior undergraduate students studying computer science and other related fields will use this book as a primary textbook. Researchers working in computer science will also find this textbook useful. This textbook has a companion web site that includes background information on relational database fundamentals, query processing, transaction management, and computer networks for those who might need this background. The web site also includes all the figures and presentation slides as well as solutions to exercises (restricted to instructors). 2b1af7f3a8