This blog post is about a Python Program that can quickly consolidate data from all Excel workbooks in computer folder. Regardless of whether you have 100, 500, or even thousands of Excel files stored in a folder, this program can easily extract the data and present it in a single Worksheet.
To use this program, you’ll need to have Python installed on your computer, along with the openpyxl library.
Table of Contents
How to install Python?
Different versions of Python is available in the following website.
https://www.python.org/downloads/
While writing this post, I am using Python 3.11.2. However, you can always go with the latest version.
Once you’ve downloaded and run the setup file, you can check whether Python is installed on your computer by searching for it in the Windows search bar. Look for Python IDLE (IDLE stands for Integrated Development Environment for Python), and if you see it, then you can be sure that Python is installed and ready to use on your computer.
How to install openpyxl?
openpyxl is a third party Python ‘library’ that allows us to Read and Edit Excel Workbooks. Once you have installed Python on your computer, activate the ‘Command Prompt’ of your computer and execute the following command.
pip install openpyxl
After installing openpyxl, type exit and press the Enter key to close the Command Prompt.
Data extraction Challenge
In the C drive of my computer, under a folder named ‘Score Card’, there are 139 Excel workbooks that contain the scorecards of Civil Engineering students at an Engineering college.
Each workbook follows the same format, with the candidate’s Name in cell C2, Roll Number in cell C3, and Scores in cells D7 to D12.
I want to consolidate the Name, Roll Number and Scores of these 139 Candidates into a single worksheet.
If done manually, the process will be like,
Open the first Workbook containing data > Copy the required data > Paste the copied into the Master Sheet > Close the first Workbook. As the data is in non contiguous cells, copy pasting data from each worksheet will take at least 30 seconds.
Given that there are 139 workbooks, this would result in a total of approximately 1 hour of work (139 x 30 = 4170 Seconds = 1.15 Hours).
By using a Python program, we can automate this task and complete it in just a few seconds. In the next section, we’ll learn how to write this program in Python.
Python code for Reading Excel files
To get started, launch Python IDLE.
Python IDLE has two parts.
- Interactive Shell – Allows us to execute Python code interactively and see the results immediately, making it useful for quick experimentation and testing of small code snippets.
- File Editor – Provides a user-friendly interface for writing, editing, and executing Python code, making it easier to develop and maintain larger programs and scripts.
We will be using the File Editor of Python IDLE for writing and executing this program.
To open the File Editor, go to the File tab > New File
To save the file, go to the File tab, click on Save.
Using the Save As dialog, we can specify the Name of the file and Location to save it.
Here, I will name the Python file as ‘Read from Excel Files’ and will save it in the desktop of the computer.
Once you have saved the file you can type the code into the File Editor.
Following is the Python code which will extract data from all Excel files in a computer folder and paste it into a single Excel Worksheet. You can either type in or copy this code into the file editor and execute it.
import os
import openpyxl
folder = 'C:\Score Card'
output_file = 'C:\Score Card\Candidate Scores.xlsx'
output_wb = openpyxl.Workbook()
output_sheet = output_wb.active
output_sheet.title = 'Candidate Data'
cells = ['C2', 'C3', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12']
for filename in os.listdir(folder):
if filename.endswith('.xlsx'):
file = os.path.join(folder, filename)
workbook = openpyxl.load_workbook(file)
values = [workbook.active[cell].value for cell in cells]
output_sheet.append(values)
output_wb.save(output_file)
Explanation of the Code
Here’s a detailed explanation of the code for those who are interested in understanding the program.
import os
os is a python module that will enable us to interact with the Operating System of the computer.
import openpyxl
openpyxl is a third party Python library that will enable us to Read and Write Excel files.
folder = 'C:\Score Card'
output_file = 'C:\Score Card\Candidate Scores.xlsx'
These lines set the folder path where the input Excel files are stored in the ‘folder‘ variable and the output file path where the final output Excel file will be saved in the ‘output_file‘ variable.
output_wb = openpyxl.Workbook()
output_sheet = output_wb.active
output_sheet.title = 'Candidate Data'
These lines create a new Excel workbook using the ‘openpyxl‘ library, and sets the active sheet to a new sheet titled ‘Candidate Data‘.
cells = ['C2', 'C3', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12']
Address of the cells from which data is to be extracted are stored in the variable called ‘cells‘.
for filename in os.listdir(folder):
if filename.endswith('.xlsx'):
file = os.path.join(folder, filename)
workbook = openpyxl.load_workbook(file)
values = [workbook.active[cell].value for cell in cells]
output_sheet.append(values)
These lines loop through each file in the ‘folder‘ directory and extract the specified data from the specified cells using the ‘openpyxl‘ library. For each file, the ‘if‘ statement checks whether the file has the extension ‘.xlsx‘. If so, it loads the workbook using ‘openpyxl.load_workbook()‘, and extracts the values of the specified cells using a list comprehension. It then appends the extracted values to the active sheet of the ‘output_wb‘ workbook.
output_wb.save(output_file)
This line saves the final output workbook to the specified file path.
Executing and modifying the Program
For executing the program, click on Run > Run module
Within a few seconds, a new Excel Workbook called ‘Candidate Scores‘ will be created in the folder specified in the program.
The names, roll numbers, and scores of all 139 candidates have been extracted into this workbook.
The program written here is for extracting scores, but the same program can be used for any kind of data extraction requirements, by modifying the following variables.
- folder – Use the folder path of the directory that contains your Excel Workbooks
- output_file – Specify the name and the location of the Excel Workbook where extracted data will be copied
- cells – Address of the cells from which data is to be copied