Importing huge files with Ruby using Enumerators

When you’re working with large files, you need to be careful with memory usage. Ruby’s File class is a great way to read files, but it reads the entire file into memory. This is not ideal when you’re working with files that are larger than your available memory, or when you’re in a memory-constrained environment or in a hurry.

The task at hand

I’m building a Magic the Gathering collection manager, using Ruby on Rails. Easy enough, right? I have a file with all the cards in the game, and I need to import them into my database. The file is a gigantic json file with over 450,000 objects. I won’t even attempt to read it all into memory at once.

Warming up: splitting the big file

The file I’m working with is called all-card-YYYY-mm-dd.json, and it’s a JSON file with an array of card objects. It includes the entire card database for the game, updated every time a new set is released. This is provided by the Scryfall API. I decided to download the file once so I don’t bomb their API with hundreds or thousands or requests and instead update it accordingly.

# all-cards-2024-08-10.json
[
  {"id": "1", "name": "Card 1", "lang": "en", set: "KHM", ...},
  {"id": "2", "name": "Card 2", "lang": "es", set: "BLB", ...},
  ...
  {"id": "450000", "name": "Card 450000", "lang": "en"}
]

The first thing I did was to split the file into smaller files. I created a rake task which reads the big file and writes smaller files by each card’s language. MtG cards are printed in a variety of languages, from english and spanish to hindi, japanese, russian, french and italian. For my use case, I only need both english and spanish, but I decided to keep the rest just in case.

languages = Set.new
lang_files = Set.new

File.foreach("scryfall_data/#{all_cards_file}").with_index do |line, _line_num|
  # Skip first and last lines, since they're the array brackets
  next if line.match?(/^\[|\]$/) 

  clean_line = line.strip
  # Remove trailing comma
  clean_line = clean_line[0..-2] if clean_line[-1] == ","

  json = JSON.parse(clean_line)
  current_language = json["lang"]

  current_lang_file = "scryfall_data/all-cards-#{current_language}.json"

  unless languages.include?(current_language)
    languages << current_language
    lang_files << current_lang_file
  end

  # Append the line to the current language file
  File.open(current_lang_file, "a") do |f|
    f.puts clean_line
  end
end

The end result here is a bunch of all-cards-<language>.json files, each with one JSON object per line. I thought I was so clever by storing plain JSON objects in a file without the need to parse the whole thing and turns out this is an already existing idea called JSON Lines. Oh well.

The main event: importing the cards

The first iteration was quite simple. I read the file line by line and parsed each line as JSON. This worked fine for the smaller files, but it was still reading the entire file into memory. I needed a way to read the file in chunks and process each chunk separately.

File.foreach("scryfall_data/all-cards-en.json").with_index do |line, _line_num|
  json = JSON.parse(line)
  
  # Process the card
  # :raw_data is a jsonb column
  Card.create(raw_data: json)
end

This is simple, intuitive and easy to understand, but it’s not efficient when going over ~140,000 cards in the english file alone. The whole process was taking about 20 minutes to complete, and while this is not a daily occurrence, I wanted to make it faster.

Enter Enumerators

Ruby’s Enumerator class is a powerful tool that allows you to work with collections in a lazy way. You can think of an Enumerator as a collection of items that you can iterate over, but it doesn’t store the items in memory. It generates them on the fly.

By using an Enumerator, I can read the file in chunks and process each chunk separately. This way, I can process the file without reading the entire thing and reducing both the memory usage and the time it takes to import the cards.

file_object = File.foreach("scryfall_data/all-cards-en.json")

file_object.each_slice(300) do |lines|
  cards = lines.map { |line| {raw_data: JSON.parse(line) } }
  
  Card.insert_all(cards)
end

This code reads the file in chunks of 300 lines and processes each chunk separately. This way, I can import the cards without reading the whole thing into memory. The insert_all method is a Rails method that inserts multiple records in a single query, which is much faster than inserting each record individually.

This change reduced the time it takes to import the cards from 20 minutes to less than 2 minutes. It’s a huge improvement, and it’s all thanks to Enumerators being awesome.

Conclusion

When working with large files, it’s important to be mindful of memory usage. By using Enumerators, you can read files in chunks and process them separately, reducing memory usage and improving performance.