Implementing Batch Jobs with Hibernate

By Thorben Janssen


Like most Java developers, you probably use Hibernate directly or via Spring Data JPA to implement your application’s persistence layer. In general, this works very well for most use cases, but it causes some issues if you need to implement a batch job.

This is because Hibernate, like most object-relational mapping frameworks, maps each database record to an entity object. It processes each one as an independent element with its own lifecycle and its own database operations. This creates an overhead if you want to implement a batch job that works on thousands of records.

At the same time, most batch jobs only use a portion of the benefits provided by Hibernate’s object-relational mapping. As useful as the flexible fetching and handling of complex graphs of entities often are, they’re not suited for mass operations. The number of executed SQL statements and the size of the object graphs would cause severe performance problems.

Because of that, my first recommendation is to implement the batch job as a separate service. That enables you to use a different persistence technology, e.g., jOOQ, that avoids the object-relational mapping overhead and might be better suited for your batch job. Within your existing application, where you often process records one by one and enjoy the benefits of Hibernate’s mapping, you can, of course, keep using Hibernate.

If you can’t implement your batch job as a separate service, you need to keep the overhead as small as possible. There are a few things that you can do to avoid the common challenges of batch jobs and to improve the performance of Hibernate.

Improve the Performance of Batch Jobs

Most batch jobs perform read and write operations, and both kinds of them need to be optimized. Let’s talk about the read operations first.

Optimize Read Operations

Read operations in a batch job are not different to read operations in any other part of your application. That means you can apply the same principles and tools as you already use in the rest of your application.

Pick the Right Projection

The first thing you should do is make sure that you use the right projection for each query. Entities are only a good fit for write operations. If you don’t change the retrieved information, you should use a DTO projection instead. They provide better performance than entities and enable you to only load the attributes you need in your business code. You can do that in different ways. The easiest one is to use a constructor expression in your JPQL query.

List<BookPublisherValue> bookPublisherValues = em.createQuery(
				"SELECT new org.thoughts.on.java.model.BookPublisherValue(b.title, b.publisher.name) FROM Book b",
				BookPublisherValue.class).getResultList();

Fetch Entities Efficiently

When fetching entity objects to change or remove them, you should use as few queries as possible to get the entity object itself and all the required associations.

That means you use 1 query to get a List of the entity objects you need to change or remove. This might sound like obvious advice, but I often see batch jobs that use a different approach in my consulting projects.

The job first gets a List of all the ids of the records that need to be changed. In the next step, the Java code then iterates through this List and gets each entity object using the EntityManager.find method. By doing that, Hibernate executes a query for each record you want to retrieve. These are often hundreds or thousands of unnecessary SQL statements that slow down your application.

After you ensured that you read all required entities in 1 query, you need to optimize the initialization of the required associations. The best and easiest way to initialize the associations is to use a JPQL query to load your entity with a JOIN FETCH clause for each required association.

List<Author> authors = em.createQuery(
				"SELECT DISTINCT a FROM Author a JOIN FETCH a.books b",
				Author.class).getResultList();

Activate JDBC Batching

When you insert, update, or delete entities, Hibernate always processes a lifecycle state change and executes one SQL statement for each of them. This often causes lots of identical SQL statements that get executed with different bind parameters in a batch environment.

To execute them more efficiently, you can activate JDBC batching. It’s a JDBC feature that we discuss in great details in the Hibernate Performance Tuning Online Training. You can easily use with Hibernate. It groups multiple consecutive, identical statements into one batch. Your application will be sending 1 statement and numerous sets of bind parameter values to the database for each batch.

16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,856 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,856 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,856 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,856 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,856 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,856 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,856 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,857 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,857 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,857 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,857 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,857 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,857 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,857 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,857 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,857 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,857 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,857 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,858 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,858 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,862 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,862 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,862 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,862 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,862 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,862 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,862 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,862 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,863 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,863 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,863 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,863 DEBUG AbstractBatchImpl:130 - Reusing batch statement
16:03:57,863 DEBUG SQL:128 - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
16:03:57,863 DEBUG BatchingBatch:384 - Executing batch size: 20

The database then executes the statement for each set of bind parameters. This reduces the number of database roundtrips and enables your database to prepare the statement once and reuse it for each bind parameter set.

To activate JDBC batching, you only need to configure the batch’s maximum size in your persistence.xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<persistence>
    <persistence-unit name="my-persistence-unit">
        ...
        <properties>
            <property name="hibernate.jdbc.batch_size" value="20"/>
            ...
        </properties>
    </persistence-unit>
</persistence>

Order Your Batch Statements

A JDBC batch gets executed when it contains the configured maximum number of statements or when the executed statement is changed. Due to that, the order in which you execute your statements has a huge impact on your JDBC batches’ efficiency.

But don’t worry, you don’t need to perform your operations in a specific order to ensure that Hibernate generates and executes the SQL statements in the right order. Due to JPA’s lifecycle model and various internal optimizations, this wouldn’t be possible. The only thing you need to do is activate the ordering of all SQL INSERT and UPDATE statements by setting the properties hibernate.order_inserts and hibernate.order_updates to true.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<persistence>
    <persistence-unit name="my-persistence-unit">
        ...
        <properties>
            <property name="hibernate.jdbc.batch_size" value="20"/>
            <property name="hibernate.order_inserts" value="true"/>
            <property name="hibernate.order_updates" value="true"/>
            ...
        </properties>
    </persistence-unit>
</persistence>

Hibernate then orders the statements internally. This ensures that all identical statements are executed after each other and can be efficiently grouped into batches.

Cleanup your PersistenceContext

My final recommendation to improve your batch job’s performance is to monitor the number of operations performed per second. Especially in older Hibernate versions, you often see that it degrades over time.

One of the reasons for that can be the number of entities managed by the PersistenceContext. The more entities it has to manage, the more memory it consumes, the longer it takes to check if an entity object is already managed or needs to be fetched from the database, and the slower your dirty checks get. To avoid that, you might consider to flush and clear your PersistenceContext at regular intervals.

for (int i = 1; i <= 22000; i++) {
	Author a = new Author();
	a.setFirstName("FirstName" + i);
	a.setLastName("LastName" + i);
	em.persist(a);

	if (i % 5000 == 0) {
		em.flush();
		em.clear();
	}
}

To get the ideal size of that interval, you need to monitor and analyze your application’s performance. It highly depends on your Hibernate version, the complexity of your entity classes, and the available amount of memory.

Conclusion

You can use object-oriented mapping frameworks to implement batch jobs, but they are often not the best fit. The mapping creates an overhead compared to plain JDBC, and most batch jobs don’t benefit a lot from the upsides these mappings provide.

If you decide to implement your batch job using Hibernate, you need to pay special attention to the optimization of read and write operations.

If you’ve been using Hibernate for a while, you are already familiar with the optimization of read operations. You should always make sure that you use the right projection and fetch your entity associations efficiently.

Small configuration changes, like activating JDBC batching and the ordering of statements, can reduce the downsides of Hibernate’s record-centric SQL statements and its overall handling. And as long as you monitor the size of your persistence context and how it affects the performance, you will be able to implement an efficient and fast batch job.


Tags


About the author

Thorben is an independent consultant, international speaker, and trainer specialized in solving Java persistence problems with JPA and Hibernate.
He is also the author of Amazon’s bestselling book Hibernate Tips - More than 70 solutions to common Hibernate problems.

Books and Courses

Coaching and Consulting

Tools

Leave a Reply

Your email address will not be published. Required fields are marked

This site uses Akismet to reduce spam. Learn how your comment data is processed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}