KINTO Tech Blog
General

Investigating the Cause of a Deadlock in Aurora MySQL 3 series in a Production Environment

Cover Image for Investigating the Cause of a Deadlock in Aurora MySQL 3 series in a Production Environment

Introduction

Hello! I'm Nakatani (@tksnakatani) from the Common Service Development Group, where we plan and develop essential functions across web and mobile app services, including membership and payment platforms.

In this article, I'll share a case of an Aurora MySQL deadlock that occurred in a production environment at a payment platform—one of those "close-call incidents" that many of us have likely encountered at least once.

The Incident Leading to the Deadlock

One day in 2024, we received an incident notification from our log monitoring system. When I reviewed the notification, I found that the following error log recorded during a credit card payment process.

Deadlock found when trying to get lock; try restarting transaction

Additionally, Slack alerted us to an inquiry from a product manager reporting that their credit card payment had failed. At that moment, I immediately sensed the gravity of the situation, and I still vividly remember breaking out into a cold sweat.

Cause investigation

Logic check

The deadlock itself was resolved naturally after about 30 minutes. Upon further investigation, we discovered that a popular product had been released around the time the error occurred, leading to a high volume of purchase requests.

A deadlock generally occurs when multiple transactions hold resources required by each other, causing a standstill. Given that load testing had been conducted to simulate such scenarios, it was puzzling what the deadlock still occurred.

Initially, we couldn't pinpoint any specific processes that would cause resource contention, so it was theoretically difficult to identify the rootcause of the deadlock.

Reproduction check

Next, we tried to reproduce the requests to the API, both before and after the deadlock occurred, in a local environment.

Using the same request parameters that caused the problem in the production environment, we sent the following two requests almost simultaneously using the curl command. As in the production environment, one request was successful, but the other resulted in a system error.

Here's an example of the curl command we used:

curl --location 'http://localhost:8080/payments/cards' \
--header 'Content-Type: application/json' \
--data '{
    "amount": 10,
    "capture": false,
    "request_id": "ITEM-20240101-0000001"
}


curl --location 'http://localhost:8080/payments/cards' \
--header 'Content-Type: application/json' \
--data '{
    "amount": 10,
    "capture": false,
    "request_id": "ITEM-20240101-0000002"
}

:::

The error log contained the following message:

Deadlock found when trying to get lock; try restarting transaction

was output.

Being able to reproduce the problem provided a crucial clue for our investigation, and for the moment, we felt a sense of relief.

SHOW ENGINE INNODB STATUS

Additionally, we used the MySQL SHOW ENGINE INNODB STATUS command to check the status of the InnoDB storage engine.

The SHOW ENGINE INNODB STATUS command provides comprehensive information about the state of the InnoDB storage engine. This information can provide clues to examine the lock and transaction details and identify the specific cause of the deadlock.

mysql> set GLOBAL innodb_status_output=ON;
mysql> set GLOBAL innodb_status_output_locks=ON;

・・・Again, sent two requests, using the curl command.

mysql> SHOW ENGINE INNODB STATUS;

The results back then are as follows: *Some portions have been excerpted and masked.

=====================================
2024-xx-xx 10:05:27 0x7fe300290700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 2 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 463 srv_active, 0 srv_shutdown, 7176 srv_idle
srv_master_thread log flush and writes: 0
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 318
OS WAIT ARRAY INFO: signal count 440
RW-shared spins 290, rounds 306, OS waits 16
RW-excl spins 1768, rounds 5746, OS waits 48
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 1.06 RW-shared, 3.25 RW-excl, 0.00 RW-sx
------------------------
LATEST DETECTED DEADLOCK
------------------------
2024-04-18 10:04:02 0x7fe3059a4700
*** (1) TRANSACTION:
TRANSACTION 12085, ACTIVE 6 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 3
MySQL thread id 70, OS thread handle 140612935517952, query id 28138 192.168.65.1 user update
insert into payments (....
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 297 page no 5 n bits 248 index uq_payments_01 of table `payment`.`payments` trx id 12085 lock_mode X locks gap before rec
Record lock, heap no 56 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 11; hex 6d65726368616e745f3031; asc merchant_01;;
 1: len 7; hex 5041594d454e54; asc PAYMENT;;
 2: len 30; hex 6276346c6178316736367175737868757676647963356737656c616a6466; asc bv4lax1g66qusxhuvvdyc5g7elajdf; (total 32 bytes);
 3: len 30; hex 70615f666f79706161656c6a71666f663378746332366b6d61756c38676e; asc pa_foypaaeljqfof3xtc26kmaul8gn; (total 35 bytes);
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 297 page no 5 n bits 248 index uq_payments_01 of table `payment`.`payments` trx id 12085 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 56 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 11; hex 6d65726368616e745f3031; asc merchant_01;;
 1: len 7; hex 5041594d454e54; asc PAYMENT;;
 2: len 30; hex 6276346c6178316736367175737868757676647963356737656c616a6466; asc bv4lax1g66qusxhuvvdyc5g7elajdf; (total 32 bytes);
 3: len 30; hex 70615f666f79706161656c6a71666f663378746332366b6d61756c38676e; asc pa_foypaaeljqfof3xtc26kmaul8gn; (total 35 bytes);
*** (2) TRANSACTION:
TRANSACTION 12084, ACTIVE 7 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 3
MySQL thread id 69, OS thread handle 140612935812864, query id 28139 192.168.65.1 user update
insert into payments (....
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 297 page no 5 n bits 248 index uq_payments_01 of table `payment`.`payments` trx id 12084 lock_mode X locks gap before rec
Record lock, heap no 56 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 11; hex 6d65726368616e745f3031; asc merchant_01;;
 1: len 7; hex 5041594d454e54; asc PAYMENT;;
 2: len 30; hex 6276346c6178316736367175737868757676647963356737656c616a6466; asc bv4lax1g66qusxhuvvdyc5g7elajdf; (total 32 bytes);
 3: len 30; hex 70615f666f79706161656c6a71666f663378746332366b6d61756c38676e; asc pa_foypaaeljqfof3xtc26kmaul8gn; (total 35 bytes);
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 297 page no 5 n bits 248 index uq_payments_01 of table `payment`.`payments` trx id 12084 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 56 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 11; hex 6d65726368616e745f3031; asc merchant_01;;
 1: len 7; hex 5041594d454e54; asc PAYMENT;;
 2: len 30; hex 6276346c6178316736367175737868757676647963356737656c616a6466; asc bv4lax1g66qusxhuvvdyc5g7elajdf; (total 32 bytes);
 3: len 30; hex 70615f666f79706161656c6a71666f663378746332366b6d61756c38676e; asc pa_foypaaeljqfof3xtc26kmaul8gn; (total 35 bytes);
*** WE ROLL BACK TRANSACTION (2)


----------------------------
END OF INNODB MONITOR OUTPUT
============================

The above indicates the following:

  • TRANSACTION 12085 and TRANSACTION 12084 exist.
  • Both TRANSACTION 12085 and TRANSACTION 12084 acquired the same "gap lock.”
  • TRANSACTION 12085 attempted to acquire an "insert intention gap lock" before insertion, but this conflicted with the gap lock of TRANSACTION 12084, causing it to wait.
  • TRANSACTION 12084 attempted to acquire an "insert intention gap lock" before insertion, but this conflicted with the gap lock of TRANSACTION 12085, causing it to wait.
  • MySQL detected a deadlock and rolled back TRANSACTION 12084.

What are gap lock and insertion intention gap lock?

Gap lock

A gap lock is a lock on the gap between index records, or a lock on the gap before the first index record or after the last index record. For example, SELECT c1 FROM t WHERE c1 BETWEEN 10 and 20 FOR UPDATE; prevents other transactions from inserting a value of 15 into column t.c1, because the gaps between all existing values in the range are locked, regardless of whether such a value already exists in the column.

https://dev.mysql.com/doc/refman/8.0/ja/innodb-locking.html#innodb-gap-locks

Insertion intention gap lock

An insertion intention lock is a type of gap lock set by an INSERT operation before a row is inserted. This lock indicates that the insertion is done intentionally so that multiple transactions inserted into the same index gap do not have to wait for each other unless they are inserted into the same place within that gap. Let's assume that there are index records whose values are 4 and 7. Separate transactions attempting to insert the values 5 and 6, respectively, use an insertion intention lock to lock the gap between 4 and 7 before acquiring an exclusive lock on the inserted rows, but do not block each other because the rows do not conflict.

https://dev.mysql.com/doc/refman/8.0/ja/innodb-locking.html#innodb-insert-intention-locks

Since it was discovered that the gap lock was the cause of the problem in question, we proceeded to identify the portion of the credit-card payment processing where the gap lock was acquired.

The overall flow of the payment processing is divided into the following three steps:

  1. Checking whether a payment has been made with the same request ID
  2. Requesting payment from the payment service provider
  3. Writing the results from the payment service provider into the database and returning a response.

We set breakpoints around the areas where SQL is issued and performed debugging in a local environment, confirming that a gap lock was acquired immediately after executing the following query:

SELECT * FROM PAYMENTS where request_id = '' FOR UPDATE;

The data from performance_schema.data_locks at that time is as follows: Situation of the remote meeting on the day

Cause

All the information has been gathered, and the cause has been identified.

In the payment platform, the request_id was received from the request source to check for duplicate requests, and this value was assigned a unique index for use in subsequent references.

Meanwhile, on the product side, the request_id was generated according to the following rule:

Product ID-YYYYMMDD-Sequential Number

When the deadlock occurred, a popular product had been released, and purchase requests for the same product were concentrated and sent in large numbers over a short period of time. As a result, a large number of requests were sent, in which the sequential number portion of request_id was counted up rapidly.

An example of a curl command

curl --location 'http://localhost:8080/payments/cards' \
--header 'Content-Type: application/json' \
--data '{
    "amount": 10,
    "capture": false,
    "request_id": "ITEM-20240101-0000001"
}

:::

As mentioned above, the main flow of credit card payment processing is as follows.

  1. Checking whether a payment has been made with the same request ID
  2. Requesting payment from the payment service provider
  3. Writing the results from the payment service provider into the database and returning a response.

The problem occurred in the first step:

SELECT * FROM PAYMENTS where request_id = '' FOR UPDATE;

This query is typically executed under the assumption that requests with the same request_id will not be received. However, since the data for the request_id had not been inserted yet, the query resulted in no action, and a gap lock was acquired.

Subsequently, in the third step, INSERT processing to write the results occurred, and it attempted to acquire an insertion intention gap lock. However, it conflicted with the gap lock that had already been acquired, resulting in a wait. As a result, MySQL detected a deadlock, and one transaction was rolled back.

Resolution

Through our investigation, we identified that the SELECT FROM ... FOR UPDATE query, which was used to check for duplicate payments, was causing the deadlock. To resolve this, we decided to discontinue this query and revised the design. Now, the data is temporarily registered when a request is received, and the transaction is committed immediately.

Although there were concerns that increasing the frequency of commits might lead to performance degradation, load testing confirmed that the necessary performance could be maintained, so we released it with this specification.

Reflection

This incident highlighted our insufficient understanding of gap locks. We didn't fully understand that a gap lock is acquired when there are no results for a SELECT FROM ... FOR UPDATE query. While I believed I carefully reviewed manuals and incorporated the ideas into our designs, I now realize that I had assumed I knew everything. https://dev.mysql.com/doc/refman/8.0/ja/innodb-locking.html

I also regret that there were aspects we could have noticed during testing.

In our load testing, we used random values (UUIDs) for the request_id, primarily to avoid performance degradation from index fragmentation and rebuilding. As a result, no deadlock occurred during the tests, and they completed successfully.

Conclusion

A deep understanding of transaction behavior and lock mechanisms is essential when working with MySQL and the InnoDB storage engine. This incident reinforced the importance of regularly reviewing documentation and specifications, and seeking expert advice when necessary.

Additionally, I learned that investigating the actual parameters used in the production environment, and conducting tests with request values based on those parameters, can help identify issues early and improve overall quality.

Facebook

関連記事 | Related Posts

We are hiring!

【Woven City決済プラットフォーム構築 PoC担当バックエンドエンジニア(シニアクラス)】/Toyota Woven City Payment Solution開発G/東京

Toyota Woven City Payment Solution開発グループについて私たちのグループはトヨタグループが取り組むWoven Cityプロジェクトの一部として、街の中で利用される決済システムの構築を行います。Woven Cityは未来の生活を実験するためのテストコースとしての街です。

【DBRE】DBRE G/東京・名古屋・大阪

DBREグループについてKINTO テクノロジーズにおける DBRE は横断組織です。自分たちのアウトプットがビジネスに反映されることによって価値提供されます。

イベント情報

P3NFEST Bug Bounty 2025 Winter 【KINTOテクノロジーズ協賛】