最近在用 Database of Databases 去查询TiDB 、PolarDB、SequoiaDB、OceanBase的相关资料并做了一些简单的对比。比如像下面那个表格一样,可以对这四个数据库的基本信息进行对比。
具体链接如下:
- https://dbdb.io/db/tidb
- https://dbdb.io/db/polardb
- https://dbdb.io/db/sequoiadb
- https://dbdb.io/db/oceanbase
我倒觉得后续这个数据库搜索引擎后续可以出一个类似对比的功能。(实在不想翻译了)
基本信息
名称 | TiDB | PolarDB | SequoiaDB | OceanBase |
---|---|---|---|---|
Website | https://pingcap.com/en | https://www.alibabacloud.com/product/polardb | http://www.sequoiadb.com | https://oceanbase.com/ |
Source Code | https://github.com/pingcap/tidb | None | https://github.com/SequoiaDB/SequoiaDB | https://github.com/oceanbase/oceanbase |
Tech Docs | https://pingcap.com/docs | None | http://doc.sequoiadb.com/cn/sequoiadb | https://open.oceanbase.com/docs |
Developer | PingCAP | Alibaba | SequoiaDB Corporation | Alibaba |
Start Year | 2015 | 2017 | 2011 | 2013 |
Project Type | Open Source | Commercial | Commercial, Open Source | Commercial, Open Source |
Written in | Go | None | C | C |
Supported languages | C, C , Cocoa, D, Eiffel, Erlang, Go, Haskell, Java, Lua, Ocaml, Perl, PHP, Python, Ruby, Scheme, SQL, Tcl | None | C, C#, C , Java, PHP, Python | C, C , Java |
Compatible With | MySQL | MySQL, PostgreSQL | MySQL, PostgreSQL, Spark SQL | MySQL |
Operating Systems | Linux | Hosted | Linux | Hosted, Linux |
Licenses | Apache v2 | Proprietary | AGPL v3 | GPL v2 |
原理对比
名称 | TiDB | PolarDB | SequoiaDB | OceanBase |
---|---|---|---|---|
简介 | TiDB is an open source distributed scalable Hybrid Transactional and Analytical Processing (HTAP) database built by PingCAP. The goal of TiDB is to serve as a one-stop solution for both OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing). | PolarDB is a commercial cloud based relational database product developed by the Alibaba. It is designed for clients with high read demand. PolarDB is compatible with two popular databases: MySQL, PostgreSQL. It has three layers. Users interact with database through computing layer. PolarFS is a distributed file system and polarStorage as storage level. PolarDB uses InnoDB as storage engine. | SequoiaDB is a distributed relational database with a storage layer and a computing layer. The storage layer is a database storage engine that uses the Raft algorithm to achieve data consistency across distributed nodes. The computing layer consists of relational database instances, which can be a MySQL instance, a semi-structured data access interface via, for example, JSON APIs, or an unstructured data storage interface with, for example, AWS S3. Key features of SequoiaDB include distributed OLTP with availability and consistency guarantees, petabyte-level horizontal scalability, Hybrid Transactional / Analytical Processing (HTAP), and 2-region 3-data-center recovery mechanisms. | OceanBase is a distributed, scalable, shared-nothing relational DBMS developed by Alibaba. The goal of OceanBase is to serve for financial scenarios which are demanding on performance, cost, scalability, and requires database with high availability and strong consistency. It is designed and optimized for OLTP applications on relational structured data, though its shared-nothing structure also supports OLAP applications. |
History | TiDB is inspired by the design of Google F1 and Google Spanner, and it supports features like horizontal scalability, strong consistency, and high availability. | PolarDB was first released in September, 2017 and was commercialized officially in April, 2018. At the same time, Alibaba Cloud shared a talk about polarDB at Conference on Data Engineering (ICDE). | None | In 2010, OceanBase team leader Zhenkun Yang joined Alibaba. Because of increasing concurrency in Alibaba's business and shortened development cycle to build a database for new transaction, Yang found that original DBMS can't support for rapidly growing workloads in Alibaba. He decided to abandon the traditional DBMS framework and develop a novel DBMS from scratch. At the very beginning, he presented three key principles for their new products: (1) distributed (2) low cost (3) high reliability. In 2013, Alipay decided to stop using Oracle. Since the alternative choice MySQL can't ensure strong consistency between active server and standby server, OceanBase got its first opportunity. From then on, OceanBase is not open sourced anymore. From 2014 to 2016, this team spent three years developing OceanBase 1.0. It is the first commercial DBMS which supports distributed transactions. From 2017, OceanBase started to serve for external customers. In 2019, OceanBase beat Oracle and won the first place in TPC-C test. |
Checkpoints | TiDB provides consistent checkpoint without blocking. Users can start a transaction and dump all the data from any table. TiDB also provides a way to get consistent data from history versions. The tidb_snapshot system variable is introduced to support reading history data. | All modifications before a checkpoint must have been made to data chunks. Logs of changes committed after a checkpoint are also allowed to appear in a checkpoint. During recovery, it will choose the newest checkpoint instead of the longest one. | Not Supported | OceanBase adopts consistent non-blocking checkpoint mechanism. Their team is currently working on the development of fuzzy non-blocking checkpoint mechanism implemented in LSM tree structure. They plan to release that in next version. |
Concurrency Control | The history versions of data are kept because each update / removal creates a new version of the data object instead of updating / removing the data object in-place. But not all the versions are kept. If the versions are older than a specific time, they will be removed completely to reduce the storage occupancy and the performance overhead caused by too many history versions. In TiDB, Garbage Collection (GC) runs periodically to remove the obsolete data versions. GC is triggered in the following way: There is a gc_worker goroutine running in the background of each TiDB server. In a cluster with multiple TiDB servers, one of the gc_worker goroutines will be automatically selected to be the leader. The leader is responsible for maintaining the GC state and sends GC commands to each TiKV region leader. | Primary node (read-write) and replica nodes (read-only) communicate through message sender and ack receiver. Each of them have a buffer pool. Replica would update itself during runtime redo operation. It uses difference between written log sequence number and replica's applied log sequence number as replica lag to keep track of the version that replica is holding. In PolarFS, it uses parallel raft protocol to coordinate multiple data chunk servers. ParallelRaft is a consensus protocol inherited from Raft but it allows out-of-order I/O completion tolerance capabilities of database. | SequoiaDB implements transactions, but does not support concurrency control options. Transactions are limited to only include record insert, delete, update, and query (read) operations. Other database operations such as creating new index are not supported in transactions and will not be logged as part of the transaction. | OceanBase adopts MVCC to do concurrency control. If the operation involves single partition or multiple partitions on single server node, it will read the snapshot of that server node. If the operation involves partitions on multiple server nodes, it executes distributed snapshot read. |
Compression | None | None | SequoiaDB uses BSON (binary JSON) format to encode and store JSON format data. | OceanBase uses column compression. It implements several encoding algorithms and it automatically chooses the most suitable one for each column. It adopts column compression to leverage data similarity, such as same data type, same value range, etc. |
Data Model | TiDB uses TiKV as the underlying data storage engine, which uses the Key-Value model and can be seen as a huge distributed ordered Map that is of high performance and reliability. | Relational | SequoiaDB supports relational, semi-structured (e.g. JSON), and unstructured (e.g. POSIX file) data models. Data model in storage is JSON. | OceanBase is a relational database. It supports relational data model and is compatible with MySQL. |
Isolation Levels | TiDB uses the Percolator transaction model. A global read timestamp is obtained when the transaction is started, and a global commit timestamp is obtained when the transaction is committed. The execution order of transactions is confirmed based on the timestamps. Repeatable Read is the default transaction isolation level in TiDB. | PolarDB maintains read view, which is an array of read write operations when a transaction starts. Replica nodes are read-only and therefore do not have read write operations. Primary node will send an initial read view to replica as part of handshake. It will be updated at redo. | SequoiaDB supports three levels of isolation - read uncommitted, read committed, and read stability. By default, the system is configured as read uncommitted. | From OceanBase 1.0, it supports read committed. Read committed is the default isolation level. From OceanBase 2.0, it supports snapshot isolation. From OceanBase 2.2, it supports serializable. |
Indexes | None | Similar to MySQL, b tree is the default index data structure. B Tree is ordered by primary key. One optimization related to B Tree is that polarDB will record the location of last insertion to facilitate insertion next time. During parallel query execution which is a feature of polarDB, it will partition the B tree to multiple workers. Each worker can only see its own partition. When one worker finished with one partition, it will automatically attach to a new partition. | Indexes in SequoiaDB use conventional B-trees. An index has a unique name for the index on the data collection and a JSON object that defines the indexing criteria and direction. Indexes can be unique or non-unique. If an index is unique, it can be null-able or non-null-able. In addition to regular indexes, SequoiaDB supports full-text searching via Elasticsearch. | The only available index structure in OceanBase is B Tree when creating index. For index range, as OceanBase splits table into partitions, it supports local indexing for local partitions and global indexing for global partitions. OceanBase also supports secondary index. It combines index keys and table primary key for secondary index. |
Joins | TiDB’s SQL layer currently supports 3 types of distributed join: hash join, sort merge join (when the optimizer thinks even the smallest table is too large to fit in memory and the predicates contain indexed columns, the optimizer would choose sort merge join) and index lookup join. | Each worker will first scan and join their own partition. PolarDB will merge each worker's join result and return to clients. | None | OceanBase supports three kinds of join algorithms: Nested Loop Join, Sort-Merge Join, Hash Join. Sort-Merge Join and Hash Join only works for equi-join scenario while Nested Loop Join works under any join conditions. For nested loop join, OceanBase supports both sequential scan and index scan for inner table. OceanBase also implements Blocked Nested Loop Join. |
Logging | TiDB uses the Raft consensus algorithm for replication, so it has Raft log. And TiDB also provides binlog to export data from the TiDB cluster. | Physical Logging | The system implements logical logging and log replay to support data consistency across distributed replicas. | OceanBase uses physiological logging to records all the modification. Physiological targets on the modification on each single page without specifying the detailed data organization within the page. OceanBase uses Paxos consensus algorithm to synchronize log replicas on different server nodes. |
Query Interface | TiDB supports SQL and MySQL dialect. | PolarDB supports the standard query interface SQL as MySQL does. It adds parallel query as a feature. To enable parallel query, there are multiple ways: set maxparalleldegree = n set forceparallelmode = on SELECT / PARALLEL() / FROM … SELECT / PARALLEL(n) / FROM … | SQL HTTP / REST Command-line / Shell | OceanBase supports standard SQL query interface, though there are slight differences in OceanBase implementation. The detailed OceanBase SQL syntax doc can be found in citations. |
Storage Architecture | Any durable storage engine stores data on disk and TiKV is no exception. But TiKV doesn’t write data to disk directly. Instead, it stores data in RocksDB and then RocksDB is responsible for the data storage. The reason is that it costs a lot to develop a standalone storage engine, especially a high-performance standalone engine. | PolarDB is disk-oriented as MySQL. Primary and replica nodes have buffer pool and they can access data and log in shared memory. Primary has the right to flush the page during normal operation. After redo in recovery, replicas will flush pages to disk. Primary node, after receiving the read view from new master, will also write pages to disk. | If the system is deployed on distributed clusters, it can be configured with either range-based or hash-based sharding. | OceanBase is a disk-oriented distributed shared-nothing DBMS. From the perspective of storage management, OceanBase is divided into multiple Zones. Each Zone is a cluster of physical server nodes. Several Zones store replicas of same partitions and synchronize logs using Paxos distributed consensus algorithm. Each Zone has multiple server nodes named ObServers. OceanBase supports physical horizontal partitioning. There are two kinds of blocks for data storage, Macro Block and Micro Block. Macro Block(2MB) is the smallest unit for write operation. Micro Block(16KB before compression) is the smallest unit for read operation. From the perspective of resource management, each database instance is considered as a tenant in OceanBase. Every tenant is allocated with a unit pool containing units. Each unit is a group of computation and storage resource on ObServers. Each tenant has at most one unit on one ObServer. OceanBase implements block cache for Micro Block to accelerate big scan query. It also implements a row cache for rows in block cache to accelerate small get query. The storage data structure of OceanBase is designed based on LSM-Tree, which is similar to the approach of LevelDB. The data modification is first recorded in MemTable (dynamic data in memory) using linked list, and the head is linked to the corresponding block in block cache. During the low peak period at night or when the size of MemTable reaches the threshold, OceanBase will merge the MemTable into SSTable(static data in disk) using one of following merge algorithms: (1) Major Compaction: Read all the static data from disk, merge it with the dynamic data and then write back to disk as new static data. This is the most expensive algorithm and is typically used after DDL operation. (2) Minor Compaction: Reuse all the Macro Blocks which are not dirty. This is the default algorithm OceanBase adopts. (3) Alternate Compaction: When one ObServer is about to compact one partition, queries on the merged partition will be sent to ObServers in other Zones storing replicas of the same partition. After compaction the merged Zone warms the cache. When having to merge data during peak period, OceanBase adopts this algorithm. This algorithm is orthogonal to minor compaction and major compaction and should be used in combination with one of them. (4) Dump: Dump the MemTable to disk as Minor SSTable and merge it with the previous dumped Minor SSTable. When the size of Minor SSTable exceeds the threshold, OceanBase merges it to SSTable using aforementioned compaction algorithm. This lightweight approach is used when the dynamic data is significantly less than static data. |
Storage Model | TiDB stores its data in the distributed key-value storage engine, TiKV. | None | 3SequoiaDB uses a JSON format record as a unit of data storage. These records are stored in collections. Collections are stored in "collection spaces". | In the first released version of OceanBase, it is designed to only support N-ary Storage Model, i.e. , DBMS stores all attributes for a single tuple contiguously in a page. From OceanBase 2.0, it supports hybrid storage model. Attributes belong to the same tuple are stored in the same block, but the tuples in the same block are compressed and stored in columnar model. |
Stored Procedures | Not Supported | None | SequoiaDB supports creating and executing custom procedures via JavaScript-like syntax in SequoiaDB Shell. | From OceanBase 2.0, Stored Procedures written in SQL is supported. It allows much more complicated queries by organizing data manipulation and multiple simple queries into procedural code. Stored Procedure is an extension of SQL in relational DBMS. |
System Architecture | The TiDB cluster has three components: the TiDB server, the PD server, and the TiKV server. - The TiDB server is stateless. It does not store data and it is for computing only. TiDB is horizontally scalable and provides the unified interface to the outside through the load balancing components such as Linux Virtual Server (LVS), HAProxy, or F5. - The Placement Driver (PD) server is the managing component of the entire cluster. - The TiKV server is responsible for storing data. From an external view, TiKV is a distributed transactional Key-Value storage engine. Region is the basic unit to store data. Each Region stores the data for a particular Key Range which is a left-closed and right-open interval from StartKey to EndKey. There are multiple Regions in each TiKV node. TiKV uses the Raft protocol for replication to ensure the data consistency and disaster recovery. The replicas of the same Region on different nodes compose a Raft Group. The load balancing of the data among different TiKV nodes are scheduled by PD. Region is also the basic unit for scheduling the load balance. | Within shared-disk, PolarDB has multiple data chunk servers which consist of chunks of data. Each chunk server has its own stand-alone non-volatile memory SSD disk. Compute nodes (database server) read and write to the disk via remote direct memory access (RDMA). | The system is configurable to deploy on a single node or on multiple distributed nodes. In a distributed environment, coordination nodes and catalog nodes share disk space, while storage nodes are given separate disk space. The storage nodes can be configured to share disk or each to be given nonshared disk. | OceanBase adopts shared-nothing system architecture. It stores replica of each partition on at least three server nodes in different server clusters. Each server node has its own SQL execution engine and storage engine. The storage engine only accesses the local data on that node. The SQL engine accesses the global schema and generates the distributed query plan. Query executors visit the storage engine of each node to distribute and gather data among them to execute the query. For each database instance, it sets one server node as active root server to provide root service like monitoring the health of all the participant nodes. The root service is responsible for load balance, data consistency, error recovery, etc. If this active root server shuts down, OceanBase automatically promotes one standby root server to a new active root server. |
Views | TiDB 3.0 provides experimental support for Views. Views in TiDB are non-materialized. This means that as a view is queried, TiDB will internally rewrite the query to combine the view definition with the SQL query. | None | None | OceanBase supports materialized view. OceanBase is a commercial DBMS mainly serving corporate clients with large-scale data storage and high QPS, so it implements materialized view to increase throughput and reduce latency in order to reduce the number of servers needed(save hardware cost). |