Benchmarking Excel Generation In Ruby And How To Avoid Background Jobs Memory Explosion!
Frantisek
My name is Frantisek and I am a committed engineer @PerAngusta. This story is about how to overcome huge memory usage by Ruby Excel spreadsheet exporters when dealing with massive data.
The Memory Problem
Until then, we were using the amazing gem called caxlsx for generating Excel files from our database. It is a very customizable library that helps you create nice and clean Excels spreadsheets from a data collection.
However the memory consumption of the gem was disastrous for our Sidekiq worker. Whenever you tried to export more than 10K records, the memory consumptions skyrocketed to 200% of the workersβ capacity, making other jobs to slow down or even fail.
The origin of the problem was that caxlsx was keeping all data and file in memory, before saving it to disk.
Even if these situations happened rarely enough to handle it manually, we had to take action.
Benchmarking some interesting gems
To solve our technical problem, we though about IO streaming. I dug deep in articles and documentations to find two gems that suited our needs: fast_excel and xlsxtream.
Implementing both of them was pretty much easy, the big question though was which one is the most efficient? It was now time to make some benchmarking. π
You can find at the end of the article the gist I used to benchmark the 3 gems I had. The results speak for themselves. What I was the most astonished is the difference gap between the gem we actually use and the two others.
For 1,000 records to export:
Memory benchmark β β β β β β β β β β β β β β β β β β -
fast_excel: 553256 allocated
xlsxtream: 3184054 allocated β 5.76x more
caxlsx: 20307803 allocated β 36.71x moreTime benchmark β β β β β β β β β β β β β β β β β β -
fast_excel: 0.0513 seconds
xlsxtream: 0.0641 seconds - 1.25x more
caxlsx: 0.7928 seconds - 15.45x more
For 100K records:
Memory benchmark β β β β β β β β β β β β β β β β β β -
fast_excel: 55201256 allocated
xlsxtream: 311132199 allocated β 5.64x more
caxlsx: 1941847581 allocated β 35.18x moreTime benchmark β β β β β β β β β β β β β β β β β β -
fast_excel: 5.2363 seconds
xlsxtream: 7.9638 seconds - 1.52x more
caxlsx: 77.9891 seconds - 14.89x more
Benchmarks β , but what about functionalities?
When you decide to switch from a library to an other, the main question remains always the functionality one. In this case my question was:
Is one of these two gem replacements functional enough for our end need?
I have made some comparisons of those 3 gems and here they are in a form of a table.
As you can see, I was pretty much confident to move from caxlsx to fast_excel. The only thing that was unable to reproduce was hyperlinks with different content than the link itself. However, you can make links in Excel with formulas, so I had a workaround.
πΎ Results after implementation
After these pretty successful benchmarks, it was time to implement fast_excel and here is how the Sidekiq worker looked like. I have set the Sidekiq concurrency to 5 and tried to launch 6 exports. As you can seen I have first tested my solution, that worked fine. Then I rolled back to the current state and tried the same. This time I have not been able to export all of them. Then I deployed the feature again and tried to push the limits harder. Everything was still flat.
π on top, when deployed in production, we managed to export 1.5million of records in a single excel with dozens of columns SUCCESSFULLY!! π₯ The file weighted 1Go but the export process had no problem.
Benchmark Github Gist
Here you can find the whole benchmark script I have performed.
# frozen_string_literal: true
require 'time'
# Excel generating gems
require 'xlsxtream'
require 'caxlsx'
require 'fast_excel'
# benchmarking gems
require 'benchmark/memory'
require 'benchmark'
@count = 1_000
@times = []
# CLEANUP - remove files that could remain in current directory
`rm xlsxtream_test_file.xlsx` if File.file?('xlsxtream_test_file.xlsx')
`rm axlsx_test_file.xlsx` if File.file?('axlsx_test_file.xlsx')
`rm fast_excel_test_file.xlsx` if File.file?('fast_excel_test_file.xlsx')
def xlsxtream
starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
# Creates a new workbook and closes it at the end of the block
Xlsxtream::Workbook.open('xlsxtream_test_file.xlsx') do |xlsx|
xlsx.write_worksheet 'test' do |sheet|
@count.times do |i|
sheet << [true, Date.today + i, 'hello', 'world', rand(1000000), i, 42**13]
end
end
end
# time consuming operation
ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
elapsed = ending - starting
@times << ['xlsxtream', elapsed]
end
def axlsx
starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
axlsx_package = Axlsx::Package.new
sheet = axlsx_package.workbook.add_worksheet(name: 'test2')
@count.times do |i|
sheet.add_row [true, Date.today + i, 'hello', 'world', rand(1000000), i, 42**13]
end
axlsx_package.serialize('axlsx_test_file.xlsx')
ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
elapsed = ending - starting
@times << ['axlsx', elapsed]
end
def fast_excel
starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
workbook = FastExcel.open('fast_excel_test_file.xlsx', constant_memory: true)
worksheet = workbook.add_worksheet('test3')
@count.times do |i|
worksheet.append_row([true, Date.today + i, 'hello', 'world', rand(1000000), i, 42**13])
end
workbook.close
ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
elapsed = ending - starting
@times << ['fast_excel', elapsed]
end
puts "Benchmarks for generating #{@acount} lines Excel"
puts 'Memory benchmark -------------------------------------'
Benchmark.memory do |x|
x.report('xlsxtream') { xlsxtream }
x.report('axlsx') { axlsx }
x.report('fast_excel') { fast_excel }
x.compare!
end
puts 'Time benchmark -------------------------------------'
puts 'Comparison:'
output = []
sorted = @times.sort_by { |t| t[1] }
sorted.each_with_index do |time, i|
more = i == 0 ? '' : format(format('%.2fx more', (time[1] / sorted[0][1])))
output << format('%s - %20s: %10.4f seconds %s', i + 1, time[0], time[1], more)
end
puts output