You are here

Memory concerns using the Database Toolbox in MATLAB on 64-bit Linux

Until recently my experience with large matrices of data within MATLAB had been limited to data that I preprocessed into CSV files via other programming languages and imported into MATLAB. I've been working on an application for one of my graduate courses that clusters ~30000 individuals based on ~60480000 timeseries measurements stored in a MySQL database. It seemed best to use the Database Toolbox to query the database directly.

After developing my application against small samples of the data using 1000 individuals, I began running into memory issues as I queried for the entire data set. For what it's worth, the text that follows is largely an excersize in "rtfm" but I'll describe my process in case it helps others.

Choosing Data Types of Results

MathWorks has an article titled Strategies for Efficient Use of Memory which was a helpful starting point. For me, the section titled Using Appropriate Data Storage was particularly helpful. My initial query was for three columns in MySQL of types INT, FLOAT, and DATETIME. The data was returned in MATLAB as two double columns and one string column in a cell array. 2016000 rows (1000 individuals x 2016 measurements) results in 457.6MB of data.

First, because the hourly timeseries measurements were over the same 2016 hours for each individual, I could break the query for timestamps into its own operation. 2016 rows of timestamp strings is 200.8kB.

Second, the remaining two columns of data are numeric. To control the data types returned, I set the DataReturnFormat property with setdbprefs.

setdbprefs('DataReturnFormat', 'numeric');

When setting the cursor data to a matrix of results, single precision was accurate enough for the numeric data I was working with.

results = single(cursor.Data);

The matrix of single precision numeric results is 15.4MB. However, the interim cursor object for this same query was 30.8MB. When querying for ~30,000 users the cursor would still be quite large and should be processed by blocks.

Process by Blocks and Clear Temporary Variables

Following the recomendation of the Process Data By Blocks section, I created a while loop and placed both the query and the appending of cursor data to the results matrix within it. Each iteration, data for individuals was queried in batches so that the cursor's memory footprint remained constant, only the results matrix grew as data was appended to it.

Cursor Batch Size

I was also able to use automated batching pursuant to the prior Process Data By Blocks recommendation. At the start of my query, near the aforementioned DataReturnFormat database preference, I added the following two lines.

setdbprefs('FetchInBatches','yes');
setdbprefs('FetchBatchSize','100');

I used automated batching when fetching results from cursor, as described in Method 2.

curs = exec(conn, '...');
curs = fetch(curs);

Changing the JVM Used by MATLAB

Though I don't think it mattered which JVM I used, it is possible to change the JVM that MATLAB uses. In Linux this is done by setting the $MATLAB_JAVA environment variable to point to the root directory of the JVM you wish to use. In Ubuntu, I set my default JVM to be the proprietary Oracle JVM located at /usr/lib/jvm/jdk1.7.0/bin/java.

I set $MATLAB_JAVA as a permanent environment variable by adding it to ~/.pam_environment. The contents of my ~/.pam_environment file look like:

MATLAB_JAVA=/usr/lib/jvm/jdk1.7.0/jre

Increasing Heap Beyond GUI Recommendations

Despite having 8GB of RAM and running 64-bit MATLAB in 64-bit Linux, the Preferences area within the GUI only allowed me to give MATLAB 4GB of heap. It is possible to give MATLAB additional heap beyond the recommended settings by editing ~/.matlab/R2013a/matlab.prf. Find the JavaMemHeapMax line and adjust the number of megabytes you would like. For example, to give MATLAB 7GB of heap the line would read:

JavaMemHeapMax=I7168