Bolk: spring batch: Dump a set of queries over a database in parallel to flat files

Wednesday, 7 August 2013

spring batch: Dump a set of queries over a database in parallel to flat files

spring batch: Dump a set of queries over a database in parallel to flat files

So my scenario drilled down to the essence is as follows: Essentially, I
have a config file containing a set of SQL queries whose result sets need
to be exported as CSV files. Since some queries may return billions of
rows, and because something may interrupt the process (bug, crash, ...), I
want to use a framework such as spring batch, which gives me restartabilty
and job monitoring. I am using a file based H2 database for persisting
spring batch jobs.
So, here are my questions:
Upon creating a Job, I need to provide my RowMapper some initial
configuration. So what happens when a job needs to be restarted after a
e.g. crash? Concretly:
Is the state of the RowMapper automatically persisted, and upon restart
Spring batch will try to restore the object from its database, or
will the RowMapper object be used that is part of the original spring
batch XML config file, or
I have to maintain the RowMapper's state using the step's/job's
ExecutionContext?
Above question is related to whether there is magic going on when using
the spring batch XML configuration, or whether I could as well create all
these beans in a programmatic way: Since I need to parse my own config
format into a spring batch job config, I rather just use spring batch's
Java classes (beans) and fill them out appropriately, rather attempting to
manually write out valid XML. However, if my Job crashes, I would create
all the beans myself again. Does spring batch automagically restore the
Job state from its database?
If I really need XML, is there a way to serialize a spring-batch
JobRepository (or one of these objects) as a spring batch XML config?
Right now, I tried to configure my Step with the following code - but I am
unsure if this is the proper way to do this:
Is TaskletStep the way to go?
Is the way I create the chunked reader/writer correct, or is there some
other object which I should use instead?
I would have assumed that opening of the reader and writer would occur
automatically as part of the JobExecution, but if I don't open these
resources prior to running the Job, I get an exception telling me that I
need to open them first. Maybe I need to create some other object that
manages the resoures (jdbc connection and file handle)?
JdbcCursorItemReader<Foobar> itemReader = new JdbcCursorItemReader<Foobar>();
itemReader.setSql(sqlStr);
itemReader.setDataSource(dataSource);
itemReader.setRowMapper(rowMapper);
itemReader.afterPropertiesSet();
ExecutionContext executionContext = new ExecutionContext();
itemReader.open(executionContext);
FlatFileItemWriter<String> itemWriter = new FlatFileItemWriter<String>();
itemWriter.setLineAggregator(new PassThroughLineAggregator<String>());
itemWriter.setResource(outResource);
itemWriter.afterPropertiesSet();
itemWriter.open(executionContext);
int commitInterval = 50000;
CompletionPolicy completionPolicy = new
SimpleCompletionPolicy(commitInterval);
RepeatTemplate repeatTemplate = new RepeatTemplate();
repeatTemplate.setCompletionPolicy(completionPolicy);
RepeatOperations repeatOperations = repeatTemplate;
ChunkProvider<Foobar> chunkProvider = new
SimpleChunkProvider<Foobar>(itemReader, repeatOperations);
ItemProcessor<Foobar, String> itemProcessor = new ItemProcessor<Foobar,
String>() {
/* Custom implemtation */ };
ChunkProcessor<Foobar> chunkProcessor = new SimpleChunkProcessor<Foobar,
String>(itemProcessor, itemWriter);
Tasklet tasklet = new ChunkOrientedTasklet<QuadPattern>(chunkProvider,
chunkProcessor); //new SplitFilesTasklet();
TaskletStep taskletStep = new TaskletStep();
taskletStep.setName(taskletName);
taskletStep.setJobRepository(jobRepository);
taskletStep.setTransactionManager(transactionManager);
taskletStep.setTasklet(tasklet);
taskletStep.afterPropertiesSet();
job.addStep(taskletStep);

Bolk

Wednesday, 7 August 2013

spring batch: Dump a set of queries over a database in parallel to flat files

No comments:

Post a Comment