Importing huge files with Ruby using Enumerators
When you’re working with large files, you need to be careful with memory usage.
Ruby’s File
class is a great way to read files, but it reads the entire file
into memory. This is not ideal when you’re working with files that are larger
than your available memory, or when you’re in a memory-constrained environment
or in a hurry.
The task at hand
I’m building a Magic the Gathering collection manager, using Ruby on Rails. Easy enough, right? I have a file with all the cards in the game, and I need to import them into my database. The file is a gigantic json file with over 450,000 objects. I won’t even attempt to read it all into memory at once.
Warming up: splitting the big file
The file I’m working with is called all-card-YYYY-mm-dd.json
, and it’s a JSON
file with an array of card
objects. It includes the entire card database for
the game, updated every time a new set is released. This is provided by the
Scryfall API. I decided to download the file once so I don’t bomb their API with
hundreds or thousands or requests and instead update it accordingly.
# all-cards-2024-08-10.json
[
{"id": "1", "name": "Card 1", "lang": "en", set: "KHM", ...},
{"id": "2", "name": "Card 2", "lang": "es", set: "BLB", ...},
...
{"id": "450000", "name": "Card 450000", "lang": "en"}
]
The first thing I did was to split the file into smaller files. I created a rake task which reads the big file and writes smaller files by each card’s language. MtG cards are printed in a variety of languages, from english and spanish to hindi, japanese, russian, french and italian. For my use case, I only need both english and spanish, but I decided to keep the rest just in case.
languages = Set.new
lang_files = Set.new
File.foreach("scryfall_data/#{all_cards_file}").with_index do |line, _line_num|
# Skip first and last lines, since they're the array brackets
next if line.match?(/^\[|\]$/)
clean_line = line.strip
# Remove trailing comma
clean_line = clean_line[0..-2] if clean_line[-1] == ","
json = JSON.parse(clean_line)
current_language = json["lang"]
current_lang_file = "scryfall_data/all-cards-#{current_language}.json"
unless languages.include?(current_language)
languages << current_language
lang_files << current_lang_file
end
# Append the line to the current language file
File.open(current_lang_file, "a") do |f|
f.puts clean_line
end
end
The end result here is a bunch of all-cards-<language>.json
files, each with
one JSON object per line. I thought I was so clever by storing plain JSON
objects in a file without the need to parse the whole thing and turns out this
is an already existing idea called JSON Lines. Oh well.
The main event: importing the cards
The first iteration was quite simple. I read the file line by line and parsed each line as JSON. This worked fine for the smaller files, but it was still reading the entire file into memory. I needed a way to read the file in chunks and process each chunk separately.
File.foreach("scryfall_data/all-cards-en.json").with_index do |line, _line_num|
json = JSON.parse(line)
# Process the card
# :raw_data is a jsonb column
Card.create(raw_data: json)
end
This is simple, intuitive and easy to understand, but it’s not efficient when going over ~140,000 cards in the english file alone. The whole process was taking about 20 minutes to complete, and while this is not a daily occurrence, I wanted to make it faster.
Enter Enumerators
Ruby’s Enumerator
class is a powerful tool that allows you to work with
collections in a lazy way. You can think of an Enumerator
as a collection of
items that you can iterate over, but it doesn’t store the items in memory. It
generates them on the fly.
By using an Enumerator
, I can read the file in chunks and process each chunk
separately. This way, I can process the file without reading the entire thing
and reducing both the memory usage and the time it takes to import the cards.
file_object = File.foreach("scryfall_data/all-cards-en.json")
file_object.each_slice(300) do |lines|
cards = lines.map { |line| {raw_data: JSON.parse(line) } }
Card.insert_all(cards)
end
This code reads the file in chunks of 300 lines and processes each chunk
separately. This way, I can import the cards without reading the whole thing
into memory. The insert_all
method
is a Rails method that inserts multiple records in a single query, which is much
faster than inserting each record individually.
This change reduced the time it takes to import the cards from 20 minutes to less than 2 minutes. It’s a huge improvement, and it’s all thanks to Enumerators being awesome.
Conclusion
When working with large files, it’s important to be mindful of memory usage. By using Enumerators, you can read files in chunks and process them separately, reducing memory usage and improving performance.