YDB meets TPC-C: distributed transactions performance now revealed

Published in

YDB.tech blog

9 min readSep 27, 2023

In our previous post on performance, dedicated to the Yahoo! Cloud Serving Benchmark (YCSB), we hinted that more benchmarks are coming. We are on target with our goal, and today we are excited to present our first results of TPC-C*, which is industry-standard On-Line Transaction Processing (OLTP) benchmark. According to these results, there are scenarios in which YDB slightly outperforms CockroachDB, another trusted and well-known distributed SQL database.

For those unfamiliar with TPC-C, this post gives a brief overview of the benchmark. Then we discuss some interesting details behind TPC-C implementation for YDB. This includes notes on Benchbase from the CMU Database Group — the renowned research group at Carnegie Mellon University led by Prof. Andy Pavlo — as well as a fork of Benchbase undertaken by our colleagues at YugabyteDB. We propose several major improvements to these implementations, which significantly reduce CPU/RAM consumption and the time needed to import the initial TPC-C data.

Finally, we present the TPC-C results, comparing YDB to CockroachDB. This comparison should be particularly interesting for distributed database enthusiasts, as YDB and CockroachDB are both distributed database management systems (TPC-C extensively uses distributed transactions), but they are based on different paradigms. YDB was initially inspired by Calvin and has since evolved beyond it, which deserves a separate blog post. Meanwhile, CockroachDB achieves consensus via the Raft protocol, an approach also chosen by YugabyteDB, another Google Spanner derivative. There are some blog posts comparing these theories against each other: here (part1, part2) prof Daniel J. Abadi, the lead inventor of the Calvin protocol, focuses on weaknesses of Spanner and the advantages of Calvin. In response, YugabyteDB claimed that Calvin derivatives suffer from a limited transactional model (NoSQL only), high latency, and low throughput. However, let us assure you: YDB is a distributed SQL relational database. We now settle this largely theoretical dispute with real performance data from our TPC-C runs.

TPC-C 101

TPC-C is a standard that describes the benchmark without any official implementation. It was approved in 1992, right after the end of the dinosaur era and the first Linux release, but still before the invention of the iPhone, Android, Tesla, SpaceX, and even Windows 95. It remains relevant, popular, and in active use. Without any doubt, TPC-C is “the only objective comparison for evaluating OLTP performance” (quote from CockroachDB).

TPC-C simulates a retail company with a configurable number of warehouses, each with 10 districts and 3,000 per-district customers. A corresponding inventory exists for the warehouses. Customers place orders composed of several items. The company, of course, tracks payments, deliveries, and the history of orders.

There are the following tables:

Warehouse
District
Customer
Order
New-Order
History
Item
Stock
Order-Line

The relation between tables and their proportions is shown on the diagram below taken from the standard.

Each district has a terminal, and all transactions are executed through these terminals. The transactions are:

NewOrder
Payment
Order-Status
Delivery
Stock-level

Each terminal can execute only one transaction at a time. It chooses the transaction according to a given probability: 0.45 for NewOrder, 0.43 for Payment, and 0.04 for each of Order Status, Delivery, and Stock Level. Each type of transaction has its own keying and thinking times, which practically limit the max terminal throughput.

As stated by the standard:

“The performance metric reported by TPC-C is a “business throughput” measuring the number of orders processed per minute. Multiple transactions are used to simulate the business activity of processing an order, and each transaction is subject to a response time constraint. The performance metric for this benchmark is expressed in transactions-per-minute-C (tpmC).”

Please note that, despite measuring only the number of NewOrder transactions, tpmC is an integral metric that characterizes the execution of all transactions.

A very wise idea in TPC-C design is that it limits the maximum number of transactions per terminal. Thus, to increase the load, you must increase the configurable number of warehouses, which increases the number of terminals. However, when you increase the number of warehouses, you also increase the dataset size. Each warehouse requires about 85 MiB of storage. You can achieve at most 12.86 tpmC per warehouse. Thus, TPC-C implementations usually output tpmC and efficiency, which are calculated from the resulting tpmC and the maximum possible tpmC.

TPC-C implementation

As already mentioned, there is no official TPC-C implementation. The only well-known generic implementation of TPC-C is part of the Benchbase project. This project supports a wide number of benchmarks and is designed for easy extension with new benchmarks and database management systems. While some vendors have contributed their support to the original Benchbase, others, like YugabyteDB, have cloned it and tailored it to suit their specific needs. Our original intention was to contribute YDB support to Benchbase, but we chose to clone and adjust it for reasons that we describe below.

Benchbase is written in Java 17. Each terminal is represented by a thread. Transactions are performed as a synchronous requests to the database. Most of the time terminals are inactive due to TPC-C requirements. Benchbase emulates this requirement by putting threads to sleep. Many sleeping threads is not an issue until you have too many of them. Also, each thread requires a stack.

The default stack size for Java threads (specifically, ThreadStackSize) is 1 MiB on x86_64 and 2 MiB on aarch64. Recall that each warehouse has 10 terminals. So, if you aim to manage 15,000 warehouses, you will need 150,000 threads and approximately 150 GiB of virtual memory. Though, resident memory depends on stack usage and paging. In case of huge pages, memory consumption will be tremendous. Even if we assume 64 KiB per thread, it gives us almost 10 GiB.

Besides that, Benchbase stores in memory various details for each executed transaction, such as the transaction type, start time, latency, worker ID, and phase ID. At first glance, this might seem like a mere 9 bytes per transaction without data alignment. However, they pre-allocate 500,000 items for each thread. This amounts to approximately 4 MiB per terminal thread and 40 MiB per warehouse. For 15,000 warehouses you’d have to allocate about 600 GiB right from the start. Given that TPC-C mandates at least two hours of execution without any warmup, you’ll end up collecting a substantial amount of latency data.

Due to these factors, we were unable to launch benchmark instance with more than 3,000 warehouses on a machine with 512 GiB of RAM (around 170 MiB per warehouse, which is excessive)! We are not alone in facing this issue. YugabyteDB has also encountered similar challenges. According to their documentaion, “for 10k warehouses, you would need ten clients of type c5.4xlarge to drive the benchmark”. The c5.4xlarge has 32 GiB of memory, so by this recommendation, you would need 320 GiB of memory, or roughly 33 MiB per warehouse. This aligns closely with our own estimates. That makes the required number of TPC-C machines comparable to the number of machines to run the database. Conducting such benchmarks in the cloud could become prohibitively expensive.

We solved the memory issue in two simple steps:

1. We opted not to store the full latency history, choosing to retain only a latency histogram instead.

2. We switched to Java 20 and employed virtual threads, a feature that has now been finalized in Java 21.

These adjustments reduced the memory requirement to 6 MiB per warehouse. It is a noticeable improvement, though there still seems to be room for further optimization.

Another challenge we faced was a concerningly high CPU consumption during the initial data import. We identified that this issue was due to the util.RandomGenerator used by all the loader threads. Altering the code to utilize a thread-local random generator dramatically cut CPU consumption from hundreds of cores to just a few. We have submitted a pull request to Benchbase with this fix.

Despite this improvement, the initial data import remained problematic for us. YDB offers three methods for adding data: INSERT, UPSERT, and BULK UPSERT. The latter is significantly faster and is designed for user data import. Unfortunately, Benchbase lacks the flexibility to accommodate BULK UPSERT without modification to its generic codebase.

We also discovered a bug in the Delivery transaction implementation, a flaw present in both the original Benchbase (issue) and in the YugabyteDB (issue). After fixing this bug in the YDB TPC-C implementation, we observed a 4% decrease in tpmC.

Benchmark setup

This time our performance testing environment is a bare metal cluster with 3 machines. Each machine has the following configuration:

Two 32-core processors Intel Xeon Gold 6338 CPU @ 2.00GHz with hyper-threading turned on (128 logical cores)
512 GiB of RAM
4xNVMe Intel-SSDPE2KE032T80
50 Gbps network
Transparent hugepages turned on
Ubuntu 20.04.3 LTS

According to our tests Intel-SSDPE2KE032T80 NVMe has the following performance characteristics:

We test the following DBMS versions and configurations listed below.

YDB 23–3 (1aea989). There is 1 storage node (32 cores taskset) and 6 compute nodes (each one in its own 16 cores taskset) per machine. You can find the full configuration here as well as the installation script.

CockroachDB 23.1.10. We run per SSD CockroachDB instance in a 32 cores taskset, 37.5 GiB cache and 37.5 GiB SQL memory. We apply this recommended configuration Here is the configuration used by our install scripts. Note, that we don’t use partitioning, which is an enterprise feature. We avoid this, because there are scenarios, when it’s not possible to partition data like in TPC-C.

Unfortunately, due to this issue, we couldn’t take YugabyteDB into the game. We hope they resolve the problem soon, so we can engage in comparative testing once again.

We ran TPC-C for 12 hours on a separate machine, which also had 128 logical cores and 512 GiB of RAM. We utilized our own benchhelpers scripts for YDB. We used the following commands for CockroachDB:

./cockroach workload fixtures import tpcc --warehouses=12000 'postgres://root@host:26257?sslmode=disable'

sleep 30m

./cockroach workload run tpcc --warehouses=12000 --ramp=30m --duration=12h `cat ../cockroach_addr`

Results

We noticed that above 12,000 warehouses CockroachDB often fails with either error:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 1261.0s        0           36.2           95.2     92.3    125.8   7516.2   7516.2 delivery
 1261.0s        0            3.0          939.2   4563.4   9126.8   9126.8   9126.8 newOrder
 1261.0s        0           43.2           95.5     12.1     26.2     29.4     29.4 orderStatus
 1261.0s        0          314.3          956.6     26.2     35.7    159.4   1140.9 payment
 1261.0s        0           33.1           95.5     28.3     88.1    167.8    167.8 stockLevel
Error: error in delivery: ERROR: inbox communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 58C01)

    108.8s        0            1.0          250.5   4563.4   4563.4   4563.4   4563.4 delivery
  108.8s        0           17.9         2478.8  10200.5  10737.4  10737.4  10737.4 newOrder
  108.8s        0            4.0          254.1   5368.7   5905.6   5905.6   5905.6 orderStatus
  108.8s        0            5.0         2509.0   5368.7  10737.4  10737.4  10737.4 payment
  108.8s        0            3.0          258.1   4563.4   5637.1   5637.1   5637.1 stockLevel
Error: error in delivery: ERROR: rpc error: code = Unavailable desc = error reading from server: read tcp [2a02:6b8:c34:14:0:1354:eb1f:2ca6]:37234->[2a02:6b8:c34:14:0:1354:eb1f:2962]:26258: use of closed network connection (SQLSTATE XXUUU)

At the same time, we observe transaction latencies up to 10 seconds both in the benchmark output and metrics. We consulted with the CockroachDB team via Slack and they confirmed that these delays were due to cluster overload. As a result, the maximum number of warehouses we managed to run with well repeatable results stood at 12,000 (with 97.8% efficiency).

In the case of YDB, we noticed performance degradation beyond 13,000 warehouses. This is why the results presented in this post are based on a configuration of 13,000 warehouses (with 99.4% efficiency).

Here are the resulting tpmC values, where higher is better. YDB was able to outperform CockroachDB by 5.6%.

Below are transaction latency percentiles, measured in milliseconds (lower is better). There is a difference in precision between YDB and CockroachDB TPC-C implementations, but in general latencies look similar.

Conclusion

In this post, we had dived into TPC-C benchmarking, the most important yardstick for evaluating OLTP performance. We introduced our own TPC-C implementation and spotlighted some of the challenges and limitations we encountered with existing frameworks like Benchbase.

According to our initial results, we succeeded in outperforming our colleagues at CockroachDB on this three-node bare-metal setup. However, this is just the beginning. There are numerous configurations still to explore and software updates that could influence results. So, stay tuned for more in-depth analyses and head-to-head comparisons.

* The results are not officially recognized TPC results and are not comparable with other TPC-C test results published on the TPC website.