Tutorial: Extracting Data from Complex Text Files

Objective:

Learn how to extract specific pieces of information from a text file with a complex structure using Python.

Overview:

In this tutorial, we’ll walk through a real-life scenario where we extract amortized times for various phases from a file containing multiple lines of logs.

1. Reading the File

To begin, we’ll read the contents of our target file.

# Reading the file
with open("path_to_file.txt", "r") as file:
    lines = file.readlines()
1
2
3

2. Identifying Patterns & Using Regular Expressions

Identifying patterns is key to data extraction. Using Python’s re module, we can match and extract specific patterns from the text.

a. Extracting Setup Phase Time

import re

# Extracting the amortized time values for "Setup Phase"
setup_times = [float(re.search(r"Setup Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
               for line in lines if "Setup Phase took" in line]
1
2
3
4
5

b. Extracting Online Phase Time

# Extracting the amortized time values for "Online Phase"
online_phase_times = [float(re.search(r"Online Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
                      for line in lines if "Online Phase took" in line]
1
2
3

c. Extracting End-to-End Time

# Extracting the "End to end amortized time" values
end_to_end_times = [float(re.search(r"End to end amortized time (\d+\.\d+) msEnd to end", line).group(1))
                    for line in lines if "End to end amortized time" in line]
1
2
3

3. Pairing Extracted Data

After extraction, you might want to pair the extracted data with other relevant information. In our example, we paired each extracted value with an order of (DBEntrySize, N):

orders = [
    # ... List of orders ...
]

paired_values_setup = list(zip(orders, setup_times))
paired_values_online = list(zip(orders, online_phase_times))
paired_values_end_to_end = list(zip(orders, end_to_end_times))
1
2
3
4
5
6
7

4. Saving Data to a New File

Finally, you can save the paired data to a new file for future reference:

# Saving the results to a file
with open("output_path.txt", "w") as file:
    for (db_size, n), setup_time, online_time, end_to_end_time in zip(orders, setup_times, online_phase_times, end_to_end_times):
        file.write(f"DBEntrySize: {db_size}, N: {n} -> "
                   f"Setup Phase Amortized Time: {setup_time} ms, "
                   f"Online Phase Amortized Time: {online_time} ms, "
                   f"End to End Amortized Time: {end_to_end_time} ms\n")
1
2
3
4
5
6
7

Conclusion:

With Python and regular expressions, extracting specific data from complex text files is efficient and straightforward. This tutorial provides a step-by-step guide to handle such tasks, ensuring you have a handy reference for similar challenges in the future.

Complete Code for Data Extraction:

import re

# 1. Reading the file
with open("path_to_file.txt", "r") as file:
    lines = file.readlines()

# 2. Identifying Patterns & Using Regular Expressions

# a. Extracting Setup Phase Time
setup_times = [float(re.search(r"Setup Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
               for line in lines if "Setup Phase took" in line]

# b. Extracting Online Phase Time
online_phase_times = [float(re.search(r"Online Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
                      for line in lines if "Online Phase took" in line]

# c. Extracting End-to-End Time
end_to_end_times = [float(re.search(r"End to end amortized time (\d+\.\d+) msEnd to end", line).group(1))
                    for line in lines if "End to end amortized time" in line]

# 3. Pairing Extracted Data
orders = [
    # ... List of orders ...
]

# 4. Saving Data to a New File
with open("output_path.txt", "w") as file:
    for (db_size, n), setup_time, online_time, end_to_end_time in zip(orders, setup_times, online_phase_times, end_to_end_times):
        file.write(f"DBEntrySize: {db_size}, N: {n} -> "
                   f"Setup Phase Amortized Time: {setup_time} ms, "
                   f"Online Phase Amortized Time: {online_time} ms, "
                   f"End to End Amortized Time: {end_to_end_time} ms\n")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

For the complete code to run successfully, replace path_to_file.txt with the path to your input file and fill in the orders list with the appropriate (DBEntrySize, N) values.

After executing the code, the results will be saved in output_path.txt. Adjust the file paths as needed based on your directory structure.

相关阅读:
read-after-write consistency 写后读一致性的解决方法
开发工程师必备————【Day18】CSS选择器详细知识介绍
【矩阵论】4. 矩阵运算——广义逆——广义逆的计算
第2章 Hive安装
高等数学（第七版）同济大学总习题六个人解答
分布式存储分层：构建高效可靠的数据存储体系
行业领先的界面开发组件DevExpress 8月发布新版——v22.1.4
【Linux 进程间通信】信号量
App分发苹果ios内测ipa应用文件签名分发平台剖析其运行模式及法律注意事项
勤于奋：国外LEAD找任务方法

原文地址：https://blog.csdn.net/weixin_38396940/article/details/133564241

Tutorial: Extracting Data from Complex Text Files