PostgreSQL's road to the most popular database

2024年1月17日水曜日

Introducing a new community wiki page "Operations cheat sheet"

This is a short announcement of a new PostgreSQL wiki page called Operations cheat sheet . I curated many useful pages (that I find so), mostly from blogs registered in Planet PostgreSQL . Also, I summarized some topics for quick reference. I created this page so that users can learn from rich wisdom PostgreSQL community people and get ready for troubleshooting. For details, please see the Introduction section.

You can reach there from the PostgreSQL Wiki top page by following "General articles and guides" -> "Operations cheat sheet."

I hope it will help, and you, as community members, will further gather community wisdom. I'd really appreciate your cooperation. Thanks.

2021年3月26日金曜日

Faster Bulk Insertion to Foreign Tables - Introduction to PostgreSQL 14 Committed Features

Introduction

Currently, the global PostgreSQL development team, led by Amit Kapila, continues to contribute to the community as part of Fujitsu's work on PostgreSQL. As a member of this team, I am actively working with talented and passionate community members around the world to advance PostgreSQL.

Our team will publish blog posts focusing on the features and patches that members have worked on. This article discusses "Batch Insert to Foreign Table" committed to PostgreSQL 14, explaining why we've worked on it and the resulting exciting performance improvements.

Why did we work on this?

Our team is focused on improving PostgreSQL's scalability and performance. One big goal is to achieve read-write scale-out. With this, we want to expand the workloads to which PostgreSQL can be applied.

The idea of using postgres_fdw, which is based on the foreign data wrapper (FDW) mechanism, is considered in the community as a promising and practical idea for interactions between multiple servers for scale-out. This is because postgres_fdw has been improved over the years by a number of developers.

At such times, we saw a committer Tomas Vondra try to speed up bulk insertion with postgres_fdw. He says that customers often report slow insertions into databases sharded with FDW.

A user reported a similar problem to the PostgreSQL bug report mailing list. He reported that it took an hour and 50 minutes to insert 20 million rows with an INSERT SELECT statement into a database sharded using postgres_fdw on the Google Cloud. It's 14 times slower than what he got with a non-sharded database, which only took 8 minutes.

The cause of the slowness is the latency of the network and the number of communications. In the current FDW mechanism, the foreign data wrapper inserts rows into the foreign table row by row. Each time, a round trip communication occurs with a remote server.

Tomas Vondra and other people were exploring how to modify only postgres_fdw to send multiple rows together to a foreign server. But while they were looking for a way to avoid changing the executor, development stopped for about three months.

What features have I developed?

I made the following proposal. The idea of sending multiple rows at once to a foreign server is the same as Tomas Vondra's.

Add functions for batch row insertion to the FDW interface: BeginForeignBatchInsert, ExecForeignBatchInsert, EndForeignBatchInsert, GetForeignBatchSize.
ExecForeignBatchInsert () takes an array of rows and the number of rows and sends them all together to the foreign server.
Executor accumulates multiple rows and passes them together to the FDW. The number of rows to accumulate is obtained from the FDW with GetForeignBatchSize ().
postgres_fdw uses the row array passed in to assemble "INSERT ... VALUES (row 1), (row 2),..., (row n)" statement and execute it on the foreign server.
Add batch_size option to the postgres_fdw's configuration of foreign server and foreign table. This specifies the number of rows to be inserted together for each foreign table.

It's really straightforward and simple, isn't it?

This speeds up INSERT VALUES statements with multiple rows, as well as INSERT SELECT statements. You do not need to change the application.

The above interfaces should be readily implemented in FDWs for other databases, such as oracle_fdw and mysql_fdw.

How much faster?

We first measured the data insertion to a simple table. The target table has one primary key column of type int and one column of type text.

The time it took for an INSERT SELECT statement to insert 1 million rows from another table into that table is as follows. The postgres_fdw batch_size parameter is set to 100. This means sending up to 100 rows at a time to the foreign server:

Local table without FDW: 6.1 seconds
Remote table with FDW (before improvement): 125.3 seconds
Remote table with FDW (after improvement): 11.1 seconds

Great! It's 11 times faster.

Next, split the target table of the same columns into eight hash partitions. Similarly, the time it took to insert 1 million rows with an INSERT SELECT statement is:

Local table without FDW: 8.6 seconds
Remote table with FDW (before improvement): 113.7 seconds
Remote table with FDW (after improvement): 12.5 seconds

This is also good! It's nine times faster.

Note that for this measurement, the foreign server was located on the same host. The network latency measured by "ping localhost "is 34 microseconds. Given the significant performance gains seen with such low latency, this feature will be even more effective in environments with higher network latency, such as the cloud.

Expectations for higher speeds

This feature makes adding data to a single external table dramatically faster. However, this is faster sequential processing using a single CPU.

On the other hand, my colleague Greg Nancarrow is working on making it possible to use multiple CPUs to process INSERT SELECT statements in parallel. Other developers are trying to parallelize the CREATE TABLE AS SELECT (CTAS) statement.

By combining these, I expect to see the potential for even faster data migration to sharded OLTP databases and ETL processing to data warehouses. I am looking forward to delivering good news again.

Acknowledgments

I'd like to express to Tomas Vondra, Amit Langote and other developers my deepest thanks for getting this feature committed relatively quickly. Thanks to their quick and enthusiastic feedback, I believe that we were able to make this feature a reality.

2016年4月18日月曜日

Join the survey to expand PostgreSQL ecosystem

Great features, but less popularity

Thanks to the long and hard effort by the community, PostgreSQL has been evolving to be a really great software comparable to existing strong commercial products. According to the following article, open source databases are gaining more popularity enough to influence the database market.

Open source threatens to eat the database market

[Excerpt]
"Though the proprietary RDBMS market grew at a sluggish 5.4 percent in 2014, the open source database market grew 31 percent to hit $562 million."
"As Gartner highlights in a recent research report, open source databases now consume 25 percent of relational database usage."

On the other hand, there is a gulf between the two top popular databases -- Oracle and MySQL -- and PostgreSQL. They are nearly five times more popular than PostgreSQL.

DB-Engines Ranking

Yes, I understand this ranking doesn't necessarily reflect the actual use, but I also don't think the ranking is far from the real popularity. In fact, some surveys show that MySQL has been in more widespread use even here in Japan than PostgreSQL since around 2010 (IIRC).

Problem: interoperability

What should be done to boost the popularity of PostgreSQL? One challenge is to increase the number of software which supports PostgreSQL. To take advantage of the trend of shift from commercial products to open source, PostgreSQL needs to interoperate with many software that are used together with the commercial databases.

The easily understandable target is Oracle, because it is anticipated that more users of Oracle will seek another database to avoid the expensive Oracle Standard Edition 2 and increasing maintenance costs. In addition, PostgreSQL has affinity for Oracle.

However, there is a problem. The number of software is very small that the users can know to interoperate with PostgreSQL. That is, when the users want to migrate from commercial databases to PostgreSQL, they can't get information on whether they can continue to use their assets with PostgreSQL. Many applications might be interoperable through standard interfaces like JDBC/ODBC, but the case is unknown. For example:

Only 24 open source projects are listed as interoperable.
Open Source Projects Using PostgreSQL
About 120 products are listed in the Software Catalog (excluding the ones like PostgreSQL extensions). But the number would drop down to about 100 if the members of the same product family are counted as one.
Even EnterpriseDB has only 12 certified application vendors .
PostgreSQL Enterprise Consortium lists only about 30 related products (Japanese only) .
MySQL touts more than 2,000 ISV/OEM/VARs .

Besides, in practice, more development efforts will probably be necessary to make PostgreSQL more interoperable with many software products. e.g. one customer asked me whether Arcserve can be used to back up PostgreSQL databases, but unfortunately I had to answer no. They are using Arcserve to back up Oracle databases and other resources. "Then, you can use NetVault instead" is not the best answer; they just want to replace the database.

Challenge: improvement and disclosure of interoperability with software products

At the steering committee of PostgreSQL Enterprise Consortium (PGECons), I proposed starting the following activity. PGECons is a Japanese non-profit organization to promote PostgreSQL for enterprise use. The members include NTT, SRA OSS, Inc. (Tatsuo Ishii runs), NEC, Hitachi, HP, Fujitsu (my company), etc. The committee concluded that we need to work in cooperation with the PostgreSQL community.

Attract and ask product/service vendors to support/use PostgreSQL. Provide technical assistance to those vendors as an organization so that they can support PostgreSQL smoothly. If the vendors aren't proactive, we verify the interoperability with their software by executing it.
Make a directory of software/services that can be used with PostgreSQL on the community web site (wiki.postgresql.org or www.postgresql.org). Software/services vendors and PostgreSQL developers/users can edit this directory. This list not only has the names of software and its vendors, but also other information such as the level of interoperability (certified by the vendor, or verified by the community/users) and remarks about configuration, tuning, and whatever useful for users.

Take the ecosystem survey

I consulted the PostgreSQL community on how to proceed, and decided to solicit opinions from (potential) users. There are so many software products that we can't address them all. Thus, we'd like to know what commercial/open-source software products are required to become (better) interoperable with PostgreSQL.

By filling out the following survey form, please tell us what software you need in order to adopt PostgreSQL in wider range. The form comprises three questions: software/usage/industry category, a specific software, and a free comment on the interoperability issue of the software. Your request will encourage us to improve the interoperability of PostgreSQL to make more software available.

Survey form
https://docs.google.com/forms/d/1Ueq4Z9MYJLv-gCp_UrPXENdSH6jXtqR0zz5F1W3Qqvo/viewform

You can see the survey result in real time in the spreadsheets below.

Survey response sheets
https://docs.google.com/spreadsheets/d/1sxd--agLVrFbn7M3YCqFTR812MrBExiyAM_4qVf9trk/pubhtml

We really appreciate your cooperation. Thanks.