Pandas with Rows (2022)

https://news.ycombinator.com/rss Hits: 1
Summary

The problem We want to find out which are the top #5 American airports with the largest average (mean) delay on domestic flights. Data We will be using the Data Expo 2009: Airline on time data dataset from the Harvard Dataverse. The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is around 120 million records, divided in 22 CSV files, one per year, and 4 auxiliary CSV files that we will not use here. The total size on disk of the dataset is around 13 Gb. The original data comes compressed, but the decompression part is not considered part of the pipeline here. Environment The available hardware to do the job are a single computer with the next specs: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz Memory: LPDDR3 15820512 kB (16 Gb) 2133 MT/s, no swap Disk: KXG50ZNV512G NVMe TOSHIBA 512GB (ext4 non-encrypted partition) OS: Linux 5.19.9 (Arch distribution, only KDE Plasma and a single Konsole session running, using 610 Mb of RAM) The versions of the software used in the post are: Naive approach A first try at solving the problem, could involve using pandas and loading the data with the next code: import pandas df = pandas.concat((pandas.read_csv(f'{year}.csv') for year in range(1987, 2009))) Unfortunately, this is likely to raise a MemoryError (or restart the kernel if you are using Jupyter), unless you have a huge amount of RAM in your system. The rest of this article describes different ways to avoid this error, and how to make your code faster and more efficient with simple options. Pure python In this particular case, we do not really need to load all the data into memory to get the average delay for each airport. We can discard the rows as we read them, just keeping track of the cumulative delay and the number of rows read for each airport. So, once all rows have been processed, we can simply compute the mean by dividing the total delay by the number of flights. This could be an imple...

First seen: 2025-12-29 23:02

Last seen: 2025-12-29 23:02