Photo by wu yi on UnsplashIntroduction:

CSV (Comma-Separated Values) files are a popular format for storing and exchanging tabular data. In Python, there are two main approaches to handling CSV files: using the built-in csv module and using the pandas library. In this article, we will compare and contrast these two approaches, highlighting their differences in terms of readability, efficiency, and flexibility. We will present two programs, one optimized for Python 2 using the csv module and another optimized for Python 3 using pandas, and discuss their strengths and weaknesses.

Program 1 (Python 2, using the csv module):

import os
import shutil
import csv

IN_PATH = “/home/mep/cdr/truteq_smsc/in”
OUT_PATH = “/home/mep/cdr/truteq_smsc/out”
BACKUP_PATH = “/home/mep/cdr/truteq_smsc/backup”

new_files = os.listdir(IN_PATH)
processed_files = os.listdir(OUT_PATH)

cdr_files = [file for file in new_files if file.startswith(“cdr”)]for file in cdr_files:
try:
with open(os.path.join(IN_PATH, file), ‘r’) as infile:
with open(os.path.join(OUT_PATH, f”{file[:-4]}_rf_txt”), ‘w’) as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
processed_row = [row[i] for i in [5, 1, 1, 2, 3, 9, 16, 18, 19]] processed_row.insert(3, 2)
if row[4] == ‘Deliver’ and row[5] == ‘Success’:
processed_row[3] = 1
del processed_row[4:6] processed_row.append(1)
empty_string_positions = [0, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24] for pos in empty_string_positions:
processed_row.insert(pos, ”)
writer.writerow(processed_row)

shutil.move(os.path.join(IN_PATH, file), BACKUP_PATH)
print(“File” + file + “processed successfully.”)
except Exception as e:
print(“Error processing file ” + file + “: ” + str(e))

Program 2 (Python 3, using pandas):

import os
import shutil
import pandas as pd

IN_PATH = r”C:testin”
OUT_PATH = r”C:testout”
BACKUP_PATH = r”C:testbackup”

new_files = os.listdir(IN_PATH)
processed_files = os.listdir(OUT_PATH)

cdr_files = [file for file in new_files if file.startswith(“cdr”)]

for file in cdr_files:
try:
df = pd.read_csv(os.path.join(IN_PATH, file), header=None)
df = df.iloc[:, [5, 1, 1, 2, 3, 9, 16, 18, 19]] df.insert(3, None, [2] * len(df))
df.loc[(df.iloc[:, 4] == ‘Deliver’) & (df.iloc[:, 5] == ‘Success’), 3] = 1
df = df.drop(df.columns[[4, 5]], axis=1)
df.insert(len(df.columns), None, [1] * len(df))
empty_string_positions = [0, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24] for pos in empty_string_positions:
df.insert(pos, None, ”)
df.to_csv(os.path.join(OUT_PATH, f”{file[:-4]}_rf_txt”), index=False)
shutil.move(os.path.join(IN_PATH, file), BACKUP_PATH)
print(f”File ‘{file}’ processed successfully.”)
except Exception as e:
print(f”Error processing file ‘{file}’: {e}”)

Both programs are designed to process CSV files containing some data from a live system. They perform the following steps:

List all files in a specified input directory (IN_PATH) that start with “cdr”.For each file, read its contents and extract specific columns (5, 1, 1, 2, 3, 9, 16, 18, 19) containing SMS data.Modify the data to conform to a specific format:Insert a new column at index 3 for SMS status, initially set to 2.If the SMS status is ‘Deliver’ and ‘Success’, set the SMS status to 1.Delete columns for ‘Type’ and ‘Status’.Add a new column for ‘Message Type’ with a value of 1.Insert empty string columns at specified positions in the data.

4. Write the modified data to a new CSV file in a specified output directory (OUT_PATH) with the original file name appended with _rf_txt.

5. Move the processed file to a backup directory (BACKUP_PATH).

Comparison and Contrast:

The first program, optimized for Python 2, uses the csv module for CSV processing. It reads files row by row, manually processing each row to extract and manipulate data. While this approach is straightforward, it can be less efficient for large datasets due to the manual iteration and processing.

On the other hand, the second program, optimized for Python 3, utilizes the pandas library for CSV processing. Pandas provides a powerful and efficient way to handle tabular data, especially for large datasets. It uses vectorized operations, making it faster and more efficient compared to the manual row processing in the first program.

Lets do a further detailed analysis:

Readability:Program 1: Uses nested with statements and explicit iteration over rows and columns, making it more verbose but easier to understand for developers familiar with basic Python file operations and CSV processing.Program 2: Uses pandas for data manipulation, which condenses the code but might be less readable for those not familiar with pandas. However, pandas provides more concise ways to handle data frames and column operations.Efficiency:Program 1: Reads and processes CSV files row by row, which can be less efficient for large files due to the repeated file I/O operations and manual row processing.Program 2: Utilizes pandas, which is optimized for handling large datasets and provides vectorized operations, potentially leading to better performance, especially for larger files.Maintainability:Program 1: Requires explicit handling of file operations and CSV parsing, making it more error-prone and harder to maintain, especially when modifications to the processing logic are needed.Program 2: With pandas, the code is more compact, and operations on data frames are more intuitive, which can lead to easier maintenance and modifications.Flexibility:Program 1: Provides more control over the processing logic, allowing for custom handling of each row and column. This can be beneficial for complex processing requirements.Program 2: While slightly less flexible in terms of low-level control, pandas offers a wide range of built-in functions and methods for common data manipulation tasks, making it suitable for many data processing needs.Error Handling:Both programs implement error handling, but Program 2 uses pandas’ built-in error handling mechanisms, which can be more robust and provide clearer error messages.Portability:Program 1: Uses standard Python libraries, making it more portable across different systems without additional dependencies.Program 2: Requires pandas, which adds a dependency but provides more advanced data manipulation capabilities.

In summary, Program 1 is more explicit and might be easier to understand for developers not familiar with pandas or data frame operations. It offers finer control over processing logic but may be less efficient and more error-prone for larger datasets. Program 2, on the other hand, leverages pandas for more concise and potentially more efficient data processing, especially for larger datasets, at the cost of a slightly steeper learning curve for those new to pandas.

Conclusion:

In conclusion, both the csv module and pandas offer ways to handle CSV files in Python, each with its own strengths and weaknesses. The csv module is simpler and more straightforward, suitable for small to medium-sized datasets and when compatibility with Python 2 is required. Pandas, on the other hand, is more efficient and powerful, making it ideal for handling large datasets and when working with Python 3. The choice between the two depends on the specific requirements of your project, balancing simplicity and efficiency.

Data Processing in Python: A Comparative Analysis of the CSV Module and Pandas” was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.