Problem: I need to copy a large block of data from a remote location into system memory as quickly as possible.
Scenario: I have a data processing system. The system is built via shell scripts on-the-fly using multiple components that are pulled in from remote locations.
One of those components is a large block of data stored as groups of files.
The requirement I have is to retrieve that large block of data from a remote location and install it into system memory as quickly as possible. This is a requirement so that the system which relies on this data can start using it for processing as soon after boot time as possible.
Question: "What would be the most efficient container for my data?"
Solutions already tried/considered:
- ISO file: requires tools for creation and reading that are not typically native
- TAR file: extracting can take a lot of time
- Remote filesystem mounted as local: slow because contents need to be copied into memory
- LVM snapshot: gear more toward backups, not built for speed on restore
Notes:
- Data loss is not a primary concern.
- The remote file transfer procedure is not a primary concern as I already have an adequate tool.
- The system is currently using Ubuntu Linux.
"The remote file transfer procedure is not a primary concern as I already have an adequate tool."
If you already have the file transferred, I suggest using mmap(2).
You should consider an image file with a file system that contains your data (put a loop device over the file with
losetup
and mount the loop device). The fastest way would probably be a compresed read-only file system like squashfs.This would even allow some tricks if not all the data is needed simultaneously. Instead of mounting the loop device you could put a DM device on top of it, mount a network file system (or network block device) with the image file, put a second loop device on top of the network version of the file and combine both loop devices with the DM device.
Let's assume you have to copy 500 MiB of data. You start copying it. As soon as the first 100 MiB have been transferred you create the loop devices and the DM device. The DM device points to the loop device of the local file for the first 100 MiB and to the other one for the rest. After e.g. each transferred 10 MiB block you suspend the DM device and reload it with the border shifted by another 10 MiB.
The risk is: If accesses go to the network version then that data is transferred twice. So if that happens often then the data transfer will take longer (the whole process may finish earlier though, depending on its access characteristics).
Edit 1:
See this answer of me to another question for an explanation how to use DM devices this way (without suspend/reload/resume though).
My initial research into the ISO container was apparently incomplete. The ISO container seems to be the most efficient, for purpose of being able to quickly get to the contents. This is based on what my research has been able to uncover, and could of course change.
Packaged in an ISO I am able to:
Using this container I have been able to get the entire process down to under 1 minute, which is an acceptable tolerance level for this project.
Creating this container is done easily in Ubuntu with a command similar to the following:
**Note that this requires
genisoimage
which is easily installed viaapt-get
.To store the file directly into memory I created a ramdisk in the
/tmp
filesystem:Retrieving the container can be done relatively quickly with a multipart transfer utility. I used one called
axel
in this manner:Finally we mount the file to a local filesystem:
The mounting process is nearly instantaneous, which allows the system to quickly begin using the data for processing.