Loading 120 Million Records into Excel
Recently, I read Christopher Woodill's article, How Much Data Can PowerPivot Really Manage? How about 122 Million Records?! I have never worked with such a big dataset in Excel, so I decided to mirror his works and create my own gigantic Excel workbook. This article will demonstrate the step-by-step process to load 120 million records into Excel.
The data comes originally from RITA; it includes every airplane that travelled in the United States from 1987 to 2008. There are 122 million records in 22 files with over 11 GB of data. I downloaded them and stored them in a folder.
Using Power Query
I will use power Query as a self-service Extract, Transform and Load tool. It is a FREE add-in that we can download from the Microsoft site. Once you install it, you can find it in Microsoft Excel.
Since all the data files are in the same folder, we can choose to load data From Folder.
We specified the folder location and then click OK button.
Query Editor window would be launched, and we would see 22 rows corresponding to each data file. We only needed the first column where the data was represented.
We right clicked on the first column, Content and selected Remove Other Columns.
Once other columns were removed, we clicked on the following icon at the Content column header.
It would show the content of the data files. At this point, some of you may notice that the Query Editor has recorded every step that we have processed so far.
I removed the following columns:
DepTime
ArrTime
FlightNum
TaiNum
ActualElapsedTime
CRSElapsedTime
AirTime
Distance
TaxiIn
TaxiOut
Diverted
WeatherDelay
NASDelay
SecurityDelay
LateAirCraftDelay
We highlighted these columns and right-clicked on the column header to select Remove Columns.
I found some of the rows had NA in either ArrDelay or DepDelay columns and excluded these rows by filtering them.
I noticed that both ArrDelay and DepDelay were Text, and I wanted to change them to Whole Numbers so that I could analyze them.
These are all the transformations I required, and it is ready to Close & Load.
Once we clicked the Load button, Power Query would start loading data into Power Pivot (Data Model).
120 million records were successfully loaded into Excel within 30 minutes.
I saved the Excel workbook and noticed the file size was about 1 GB.
Let's create a PivotTable and a Slicer in Excel. We can see that there are 120,947,440 records in the data model.
My own user experience was more than satisfactory. The PivotTable could be refreshed within one to two seconds after I selected a new set of destinations using the slicer.
Conclusion
My machine is equipped with Intel i7-2660K and 16 GB memory, i.e. not the latest and greatest. I could slice and dice the Pivot Table from various dimensions; sometimes, it might take a few seconds to refresh, but never over 10 seconds. I am pretty impressed with the performance.
One hundred twenty million records are quite a lot of data; we probably would not have such a number of records even if we import seven years of financial results for many smaller businesses.
I encourage you to go through the same exercise and experience what Power Query and Power Pivot can offer to Excel. We are in the 21st century, and they are part of modern Excel.
Such granularity allows for better and faster analysis! No more high-level guesstimates!
Great example Andrew of how to improve efficiency of data processing which also ensures the integrity of data which is also key.
Actuarial Automation Engineer | Bridging the Gap Between Actuarial and IT
9yHi David, I used both enterprise and desktop products, e.g. Oracle, SQL Server, Excel. I think they support each others. IT is responsible for the whole company's infrastructure and they are often busy and don't have resource to support every business department in a timely basis. And this is why business department have their own end user computing (EUC) applications. Accountants and other financial analysts have the need to provide better and faster business insights to their senior management. We may need a lot of data to do our ad hoc analysis, enterprise business intelligence may be too expensive and risky to implement. However, IT should be responsible to integrate / migrate our EUC applications to their enterprise platform so that they have better security and under proper control. Power Pivot / Power Query are great tools for individual business department and IT would still be responsible for the whole company, e.g. SQL Server. We can upload PowerPivot / Power Query to Sharepoint server for better control and security.
Associate Actuary at JM Family Enterprises, JM&A Group
9yGreat article Andrew. I'm thinking about why these tools aren't widely used and a big factor may be that corporate IT departments don't want to give up centralized control of data. If an individual employee can manipulate so much data on their own desktop or laptop, then that data that is not in the sphere of the IT department. Their are political issues and security issues involved that may violate corporate policies.
Business Intelligence Expert | Power BI | Microsoft Azure BI | Power Platform | Microsoft Fabric | Project management | Microsoft certified PL-300, DP-500
9yNice article, thanks for detail instructions and time results.
Sr Cloud Solution Architect - Data & AI, Author, Speaker, MCSE
9yGreat post! We are working on similar machines so I would have a similar experience. If your goal is to take Power Pivot to the max there are a few things I think you should consider. Things like cardinality and data types have a big impact on how big and performant your data model can be. You can benefit from the compression method that power pivot uses if your columns have a low number of distinct values and if you use the best data types for your data. I have tried to introduce power pivot on organizations I have worked for since it was introduced in 2010. Unfortunately I have been semi-successful. It is an amazing tool!