• Tutorial: Extracting Data from Complex Text Files


    Tutorial: Extracting Data from Complex Text Files

    Objective:

    Learn how to extract specific pieces of information from a text file with a complex structure using Python.

    Overview:

    In this tutorial, we’ll walk through a real-life scenario where we extract amortized times for various phases from a file containing multiple lines of logs.

    Table of Contents:

    1. Reading the File
    2. Identifying Patterns & Using Regular Expressions
    3. Pairing Extracted Data
    4. Saving Data to a New File

    1. Reading the File

    To begin, we’ll read the contents of our target file.

    # Reading the file
    with open("path_to_file.txt", "r") as file:
        lines = file.readlines()
    
    • 1
    • 2
    • 3

    2. Identifying Patterns & Using Regular Expressions

    Identifying patterns is key to data extraction. Using Python’s re module, we can match and extract specific patterns from the text.

    a. Extracting Setup Phase Time

    import re
    
    # Extracting the amortized time values for "Setup Phase"
    setup_times = [float(re.search(r"Setup Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
                   for line in lines if "Setup Phase took" in line]
    
    • 1
    • 2
    • 3
    • 4
    • 5

    b. Extracting Online Phase Time

    # Extracting the amortized time values for "Online Phase"
    online_phase_times = [float(re.search(r"Online Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
                          for line in lines if "Online Phase took" in line]
    
    • 1
    • 2
    • 3

    c. Extracting End-to-End Time

    # Extracting the "End to end amortized time" values
    end_to_end_times = [float(re.search(r"End to end amortized time (\d+\.\d+) msEnd to end", line).group(1))
                        for line in lines if "End to end amortized time" in line]
    
    • 1
    • 2
    • 3

    3. Pairing Extracted Data

    After extraction, you might want to pair the extracted data with other relevant information. In our example, we paired each extracted value with an order of (DBEntrySize, N):

    orders = [
        # ... List of orders ...
    ]
    
    paired_values_setup = list(zip(orders, setup_times))
    paired_values_online = list(zip(orders, online_phase_times))
    paired_values_end_to_end = list(zip(orders, end_to_end_times))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    4. Saving Data to a New File

    Finally, you can save the paired data to a new file for future reference:

    # Saving the results to a file
    with open("output_path.txt", "w") as file:
        for (db_size, n), setup_time, online_time, end_to_end_time in zip(orders, setup_times, online_phase_times, end_to_end_times):
            file.write(f"DBEntrySize: {db_size}, N: {n} -> "
                       f"Setup Phase Amortized Time: {setup_time} ms, "
                       f"Online Phase Amortized Time: {online_time} ms, "
                       f"End to End Amortized Time: {end_to_end_time} ms\n")
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    Conclusion:

    With Python and regular expressions, extracting specific data from complex text files is efficient and straightforward. This tutorial provides a step-by-step guide to handle such tasks, ensuring you have a handy reference for similar challenges in the future.

    Complete Code for Data Extraction:

    import re
    
    # 1. Reading the file
    with open("path_to_file.txt", "r") as file:
        lines = file.readlines()
    
    # 2. Identifying Patterns & Using Regular Expressions
    
    # a. Extracting Setup Phase Time
    setup_times = [float(re.search(r"Setup Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
                   for line in lines if "Setup Phase took" in line]
    
    # b. Extracting Online Phase Time
    online_phase_times = [float(re.search(r"Online Phase took \d+ ms, amortized time (\d+\.\d+) ms", line).group(1))
                          for line in lines if "Online Phase took" in line]
    
    # c. Extracting End-to-End Time
    end_to_end_times = [float(re.search(r"End to end amortized time (\d+\.\d+) msEnd to end", line).group(1))
                        for line in lines if "End to end amortized time" in line]
    
    # 3. Pairing Extracted Data
    orders = [
        # ... List of orders ...
    ]
    
    # 4. Saving Data to a New File
    with open("output_path.txt", "w") as file:
        for (db_size, n), setup_time, online_time, end_to_end_time in zip(orders, setup_times, online_phase_times, end_to_end_times):
            file.write(f"DBEntrySize: {db_size}, N: {n} -> "
                       f"Setup Phase Amortized Time: {setup_time} ms, "
                       f"Online Phase Amortized Time: {online_time} ms, "
                       f"End to End Amortized Time: {end_to_end_time} ms\n")
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32

    For the complete code to run successfully, replace path_to_file.txt with the path to your input file and fill in the orders list with the appropriate (DBEntrySize, N) values.

    After executing the code, the results will be saved in output_path.txt. Adjust the file paths as needed based on your directory structure.

  • 相关阅读:
    『亚马逊云科技产品测评』活动征文|通过Lightsail搭建个人笔记
    Python利器:os与chardet读取多编码文件
    Mentor Pads中的关键技巧和工作方法
    CSRF(Pikachu靶场练习)
    VideoPipe可视化视频结构化框架新增功能详解(2022-11-4)
    【问题】SpringBoot之GET请求参数偶发性丢失问题
    97%的客户说评论影响购买决策,那么跨境商家如何影响这97%呢?
    【精华】Python基础知识精华
    【云原生】Docker Compose安装使用详解
    【动态规划】Leetcode 279. 完全平方数【中等】
  • 原文地址:https://blog.csdn.net/weixin_38396940/article/details/133564241