Managing Long-Running Jobs in Spring Batch: Troubleshooting Connection Issues

Dealing with long-running jobs in Spring Batch can be tricky, especially when it comes to managing database connections and transactions. In our latest post on Developer’s Coffee, we dive into a real-world issue we faced with a Spring Batch job that timed out due to connection limits. Discover how we tackled the problem by tweaking HikariCP settings and adjusting load balancer timeouts to ensure smooth and reliable batch processing. Read on to learn more about our solutions and how you can apply them to your projects.

Introduction

Spring Batch is a powerful framework for processing large volumes of data. However, it can pose challenges when dealing with long-running jobs, especially regarding database connections and transaction management. In this post, we’ll explore a real-world issue encountered during a long-running batch job and how we resolved it.

The Problem

We had a Spring Batch job that included several steps, one of which took a considerable amount of time to execute. The job utilized HikariCP for connection pooling, with a max-lifetime setting of 30 minutes. The database in use was YugabyteDB.

Here’s a snippet of our job configuration:

return new StepBuilder("step1", jobRepository)
        .<UUID, UUID>chunk(employeeTmpProcessorPageSize, transactionManager)
        .reader(employeeTmpReader)
        .writer(employeeTmpWriter)
        .faultTolerant()
        .skipLimit(step1SkipLimit)
        .skip(RuntimeException.class)
        .allowStartIfComplete(true)
        .listener(new SPItemProcessorListener())
        .taskExecutor(taskExecutor)
        .build();

During execution, if step1 took longer than 30 minutes, we encountered the following exception:

org.springframework.dao.DataAccessResourceFailureException: PreparedStatementCallback; SQL [UPDATE ...]; An I/O error occurred while sending to the backend.; nested exception is com.yugabyte.util.PSQLException: An I/O error occurred while sending to the backend.
...
Caused by: java.sql.SQLException: Connection is closed
    at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:502)

Root Cause Analysis (RCA)

1. Connection Timeout: The HikariCP connection pool closed connections that were idle for more than 30 minutes. This caused issues for long-running steps that exceeded this duration.

2. Database Load Balancer Timeout: The load balancer’s idle timeout was set to a lower value (e.g., 30 minutes), leading to connection drops for long-running transactions.

3. Transactional Context: Spring Batch’s transactional context maintained connections to update job metadata. When the connection closed, it led to failures in updating this metadata.

Solution

To address the issue, we implemented the following changes:

1. Increase HikariCP max-lifetime: This setting was increased to a higher value to ensure connections remain open for the duration of long-running steps. However, this was a temporary measure.

spring.datasource.hikari.max-lifetime=5400000 # 90 minutes

2. Adjust Load Balancer Timeout: We increased the idle timeout setting on the Azure load balancer to 60 minutes to align with our expected job execution times.

3. Periodic Connection Validation: We added a connection validation query to periodically check and refresh the connection during long-running steps.

spring.datasource.hikari.validation-timeout=30000 # 30 seconds
spring.datasource.hikari.idle-timeout=60000 # 60 seconds

Result

After implementing these changes, the long-running batch job executed successfully without connection timeout issues. The job repository was able to maintain and update the job execution context, ensuring data consistency and successful job completion.

Conclusion

Handling long-running jobs in Spring Batch requires careful consideration of database connection settings and transaction management. By increasing connection lifetimes and aligning load balancer settings with job requirements, we can ensure robust and reliable batch processing.

Stay tuned for more insights and solutions to common development challenges at Developer’s Coffee!

Feel free to share your experiences and solutions to similar issues in the comments below. Happy coding!

Reference:

https://medium.com/@office.yeon/spring-batch-connection-timeout-correlation-between-mysql-and-hikaricp-d27e4112c9c3

https://stackoverflow.com/questions/77243253/spring-batch-connection-closed-when-long-time-step-execution

3 Comments

  1. Its like you learn my thoughts! You seem to grasp a lot approximately this, such as you wrote the guide in it or something. I believe that you simply could do with a few percent to power the message home a little bit, however other than that, that is excellent blog. An excellent read. I’ll certainly be back.

  2. I like what you guys are up also. Such smart work and reporting! Carry on the superb works guys I have incorporated you guys to my blogroll. I think it will improve the value of my site :).

Leave a Reply

Your email address will not be published. Required fields are marked *