Remove unnecessary data structure complexity #38

Open
opened 2024-07-16 12:04:47 +02:00 by sebastianw · 8 comments
Collaborator

Data structures related to the structure of folders when PyLoT was written. Migration of data structures and handling of manifold structures has become a lot easier lately. Thus, the formerly used classes can be removed or simplified.

Acceptance Criteria:

Data structures are only used for special cases. Unused sub-classes are removed.

#39

Data structures related to the structure of folders when PyLoT was written. Migration of data structures and handling of manifold structures has become a lot easier lately. Thus, the formerly used classes can be removed or simplified. # Acceptance Criteria: Data structures are only used for special cases. Unused sub-classes are removed. ## related Pull-Request: https://git.geophysik.ruhr-uni-bochum.de/marcel/pylot/pulls/39
sebastianw added the
Priority:0
enhancement
analysis
labels 2024-07-16 12:05:11 +02:00
sebastianw self-assigned this 2024-07-16 12:05:29 +02:00
Author
Collaborator

@marcel FYI

@marcel FYI
Author
Collaborator

Design proposal:

  • separate data
    • event data
    • waveform data
  • remove structure
    • replace by data handlers for
      • event data
      • waveform data
    • remove usages
    • remove data structure class
Design proposal: - [ ] separate data - [ ] event data - [ ] waveform data - [ ] remove structure - [ ] replace by data handlers for - [ ] event data - [ ] waveform data - [ ] remove usages - [ ] remove data structure class
Author
Collaborator

For actual implementation I would need a test data set. Alternatively, @marcel takes the implementation on this branch for testing purposes. Are there other data structures in use?

This is the first proposal for a change:

from obspy import read, read_events, Stream, Catalog
from dataclasses import dataclass, field
from typing import List
import os
import fnmatch

@dataclass
class SeismicEventData:
    event_id: str = ""
    catalog: Catalog = field(default_factory=Catalog)

    def find_event_files(self, directory: str, extensions: List[str]) -> List[str]:
        """
        Browse the directory to find event files with specified extensions.

        Parameters:
        directory (str): The directory path to search for event files.
        extensions (List[str]): List of file extensions to search for.

        Returns:
        List[str]: List of file paths that match the given extensions.

        Example:
        >>> sed = SeismicEventData()
        >>> sed.find_event_files('test_directory', ['.xml', '.quakeml']) # doctest: +SKIP
        ['test_directory/event1.xml', 'test_directory/event2.quakeml']
        """
        matches = []
        for root, _, files in os.walk(directory):
            for ext in extensions:
                for filename in fnmatch.filter(files, f'*{ext}'):
                    matches.append(os.path.join(root, filename))
        return matches

    def read_event_from_directory(self, directory: str, extensions: List[str], format: str) -> None:
        """
        Read a seismic event from the first found file in the directory with specified format.

        Parameters:
        directory (str): The directory path to search for event files.
        extensions (List[str]): List of file extensions to search for.
        format (str): The format to read the event file.

        Example:
        >>> sed = SeismicEventData()
        >>> sed.read_event_from_directory('test_directory', ['.xml', '.quakeml'], 'QUAKEML') # doctest: +SKIP
        """
        event_files = self.find_event_files(directory, extensions)
        if event_files:
            self.read_event(event_files[0], format)
        else:
            raise FileNotFoundError(f"No event files found in directory {directory} with extensions {extensions}.")

    def read_event(self, file_path: str, format: str) -> None:
        """
        Read a seismic event from a file with specified format.

        Parameters:
        file_path (str): The path to the event file.
        format (str): The format to read the event file.

        Example:
        >>> sed = SeismicEventData()
        >>> sed.read_event('test_directory/event1.xml', 'QUAKEML') # doctest: +SKIP
        """
        if os.path.exists(file_path):
            self.catalog = read_events(file_path, format=format)
            self.event_id = self.catalog[0].resource_id.id.split('/')[-1] if self.catalog else ""
        else:
            raise FileNotFoundError(f"File {file_path} does not exist.")

    def write_event(self, file_path: str, format: str) -> None:
        """
        Write the seismic event to a file with specified format.

        Parameters:
        file_path (str): The path to the output file.
        format (str): The format to write the event file.

        Example:
        >>> sed = SeismicEventData(event_id='12345')
        >>> sed.write_event('output_directory/event1.xml', 'QUAKEML') # doctest: +SKIP
        """
        self.catalog.write(file_path, format=format)

@dataclass
class WaveformData:
    stream: Stream = field(default_factory=Stream)

    def find_waveform_files(self, directory: str, extensions: List[str]) -> List[str]:
        """
        Browse the directory to find waveform files with specified extensions.

        Parameters:
        directory (str): The directory path to search for waveform files.
        extensions (List[str]): List of file extensions to search for.

        Returns:
        List[str]: List of file paths that match the given extensions.

        Example:
        >>> wd = WaveformData()
        >>> wd.find_waveform_files('test_directory', ['.mseed']) # doctest: +SKIP
        ['test_directory/waveform1.mseed']
        """
        matches = []
        for root, _, files in os.walk(directory):
            for ext in extensions:
                for filename in fnmatch.filter(files, f'*{ext}'):
                    matches.append(os.path.join(root, filename))
        return matches

    def read_waveform_from_directory(self, directory: str, extensions: List[str], format: str) -> None:
        """
        Read waveform data from the first found file in the directory with specified format.

        Parameters:
        directory (str): The directory path to search for waveform files.
        extensions (List[str]): List of file extensions to search for.
        format (str): The format to read the waveform file.

        Example:
        >>> wd = WaveformData()
        >>> wd.read_waveform_from_directory('test_directory', ['.mseed'], 'MSEED') # doctest: +SKIP
        """
        waveform_files = self.find_waveform_files(directory, extensions)
        if waveform_files:
            self.read_waveform(waveform_files[0], format)
        else:
            raise FileNotFoundError(f"No waveform files found in directory {directory} with extensions {extensions}.")

    def read_waveform(self, file_path: str, format: str) -> None:
        """
        Read waveform data from a file with specified format.

        Parameters:
        file_path (str): The path to the waveform file.
        format (str): The format to read the waveform file.

        Example:
        >>> wd = WaveformData()
        >>> wd.read_waveform('test_directory/waveform1.mseed', 'MSEED') # doctest: +SKIP
        """
        if os.path.exists(file_path):
            self.stream = read(file_path, format=format)
        else:
            raise FileNotFoundError(f"File {file_path} does not exist.")

    def write_waveform(self, file_path: str, format: str) -> None:
        """
        Write the waveform data to a file with specified format.

        Parameters:
        file_path (str): The path to the output file.
        format (str): The format to write the waveform file.

        Example:
        >>> wd = WaveformData()
        >>> wd.write_waveform('output_directory/waveform1.mseed', 'MSEED') # doctest: +SKIP
        """
        self.stream.write(file_path, format=format)

# Example usage:
# seismic_event = SeismicEventData()
# seismic_event.read_event_from_directory("path_to_directory", extensions=[".xml", ".quakeml"], format="QUAKEML")
# seismic_event.write_event("output_event_file.xml", format="QUAKEML")

# waveform_data = WaveformData()
# waveform_data.read_waveform_from_directory("path_to_directory", extensions=[".mseed"], format="MSEED")
# waveform_data.write_waveform("output_waveform_file.mseed", format="MSEED")

Explanation

  • SeismicEventData Class:

    • find_event_files(directory, extensions):
      • Searches the directory for files with specified extensions and returns a list of matches.
      • Parameters:
        • directory (str): The directory path to search for event files.
        • extensions (List[str]): List of file extensions to search for.
      • Returns: List of file paths that match the given extensions.
      • Example usage is provided in the docstring.
    • read_event_from_directory(directory, extensions, format):
      • Uses find_event_files to get the list of event files and reads the first one found, using the specified format.
      • Parameters:
        • directory (str): The directory path to search for event files.
        • extensions (List[str]): List of file extensions to search for.
        • format (str): The format to read the event file.
      • Example usage is provided in the docstring.
    • read_event(file_path, format):
      • Reads an event from a given file path, using the specified format.
      • Parameters:
        • file_path (str): The path to the event file.
        • format (str): The format to read the event file.
      • Example usage is provided in the docstring.
    • write_event(file_path, format):
      • Writes the event to a specified file path, using the specified format.
      • Parameters:
        • file_path (str): The path to the output file.
        • format (str): The format to write the event file.
      • Example usage is provided in the docstring.
  • WaveformData Class:

    • find_waveform_files(directory, extensions):
      • Searches the directory for files with specified extensions and returns a list of matches.
      • Parameters:
        • directory (str): The directory path to search for waveform files.
        • extensions (List[str]): List of file extensions to search for.
      • Returns: List of file paths that match the given extensions.
      • Example usage is provided in the docstring.
    • read_waveform_from_directory(directory, extensions, format):
      • Uses find_waveform_files to get the list of waveform files and reads the first one found, using the specified format.
      • Parameters:
        • directory (str): The directory path to search for waveform files.
        • extensions (List[str]): List of file extensions to search for.
For actual implementation I would need a test data set. Alternatively, @marcel takes the implementation on this branch for testing purposes. Are there other data structures in use? This is the first proposal for a change: ```python from obspy import read, read_events, Stream, Catalog from dataclasses import dataclass, field from typing import List import os import fnmatch @dataclass class SeismicEventData: event_id: str = "" catalog: Catalog = field(default_factory=Catalog) def find_event_files(self, directory: str, extensions: List[str]) -> List[str]: """ Browse the directory to find event files with specified extensions. Parameters: directory (str): The directory path to search for event files. extensions (List[str]): List of file extensions to search for. Returns: List[str]: List of file paths that match the given extensions. Example: >>> sed = SeismicEventData() >>> sed.find_event_files('test_directory', ['.xml', '.quakeml']) # doctest: +SKIP ['test_directory/event1.xml', 'test_directory/event2.quakeml'] """ matches = [] for root, _, files in os.walk(directory): for ext in extensions: for filename in fnmatch.filter(files, f'*{ext}'): matches.append(os.path.join(root, filename)) return matches def read_event_from_directory(self, directory: str, extensions: List[str], format: str) -> None: """ Read a seismic event from the first found file in the directory with specified format. Parameters: directory (str): The directory path to search for event files. extensions (List[str]): List of file extensions to search for. format (str): The format to read the event file. Example: >>> sed = SeismicEventData() >>> sed.read_event_from_directory('test_directory', ['.xml', '.quakeml'], 'QUAKEML') # doctest: +SKIP """ event_files = self.find_event_files(directory, extensions) if event_files: self.read_event(event_files[0], format) else: raise FileNotFoundError(f"No event files found in directory {directory} with extensions {extensions}.") def read_event(self, file_path: str, format: str) -> None: """ Read a seismic event from a file with specified format. Parameters: file_path (str): The path to the event file. format (str): The format to read the event file. Example: >>> sed = SeismicEventData() >>> sed.read_event('test_directory/event1.xml', 'QUAKEML') # doctest: +SKIP """ if os.path.exists(file_path): self.catalog = read_events(file_path, format=format) self.event_id = self.catalog[0].resource_id.id.split('/')[-1] if self.catalog else "" else: raise FileNotFoundError(f"File {file_path} does not exist.") def write_event(self, file_path: str, format: str) -> None: """ Write the seismic event to a file with specified format. Parameters: file_path (str): The path to the output file. format (str): The format to write the event file. Example: >>> sed = SeismicEventData(event_id='12345') >>> sed.write_event('output_directory/event1.xml', 'QUAKEML') # doctest: +SKIP """ self.catalog.write(file_path, format=format) @dataclass class WaveformData: stream: Stream = field(default_factory=Stream) def find_waveform_files(self, directory: str, extensions: List[str]) -> List[str]: """ Browse the directory to find waveform files with specified extensions. Parameters: directory (str): The directory path to search for waveform files. extensions (List[str]): List of file extensions to search for. Returns: List[str]: List of file paths that match the given extensions. Example: >>> wd = WaveformData() >>> wd.find_waveform_files('test_directory', ['.mseed']) # doctest: +SKIP ['test_directory/waveform1.mseed'] """ matches = [] for root, _, files in os.walk(directory): for ext in extensions: for filename in fnmatch.filter(files, f'*{ext}'): matches.append(os.path.join(root, filename)) return matches def read_waveform_from_directory(self, directory: str, extensions: List[str], format: str) -> None: """ Read waveform data from the first found file in the directory with specified format. Parameters: directory (str): The directory path to search for waveform files. extensions (List[str]): List of file extensions to search for. format (str): The format to read the waveform file. Example: >>> wd = WaveformData() >>> wd.read_waveform_from_directory('test_directory', ['.mseed'], 'MSEED') # doctest: +SKIP """ waveform_files = self.find_waveform_files(directory, extensions) if waveform_files: self.read_waveform(waveform_files[0], format) else: raise FileNotFoundError(f"No waveform files found in directory {directory} with extensions {extensions}.") def read_waveform(self, file_path: str, format: str) -> None: """ Read waveform data from a file with specified format. Parameters: file_path (str): The path to the waveform file. format (str): The format to read the waveform file. Example: >>> wd = WaveformData() >>> wd.read_waveform('test_directory/waveform1.mseed', 'MSEED') # doctest: +SKIP """ if os.path.exists(file_path): self.stream = read(file_path, format=format) else: raise FileNotFoundError(f"File {file_path} does not exist.") def write_waveform(self, file_path: str, format: str) -> None: """ Write the waveform data to a file with specified format. Parameters: file_path (str): The path to the output file. format (str): The format to write the waveform file. Example: >>> wd = WaveformData() >>> wd.write_waveform('output_directory/waveform1.mseed', 'MSEED') # doctest: +SKIP """ self.stream.write(file_path, format=format) # Example usage: # seismic_event = SeismicEventData() # seismic_event.read_event_from_directory("path_to_directory", extensions=[".xml", ".quakeml"], format="QUAKEML") # seismic_event.write_event("output_event_file.xml", format="QUAKEML") # waveform_data = WaveformData() # waveform_data.read_waveform_from_directory("path_to_directory", extensions=[".mseed"], format="MSEED") # waveform_data.write_waveform("output_waveform_file.mseed", format="MSEED") ``` ### Explanation - **SeismicEventData Class**: - `find_event_files(directory, extensions)`: - Searches the directory for files with specified extensions and returns a list of matches. - Parameters: - `directory` (str): The directory path to search for event files. - `extensions` (List[str]): List of file extensions to search for. - Returns: List of file paths that match the given extensions. - Example usage is provided in the docstring. - `read_event_from_directory(directory, extensions, format)`: - Uses `find_event_files` to get the list of event files and reads the first one found, using the specified format. - Parameters: - `directory` (str): The directory path to search for event files. - `extensions` (List[str]): List of file extensions to search for. - `format` (str): The format to read the event file. - Example usage is provided in the docstring. - `read_event(file_path, format)`: - Reads an event from a given file path, using the specified format. - Parameters: - `file_path` (str): The path to the event file. - `format` (str): The format to read the event file. - Example usage is provided in the docstring. - `write_event(file_path, format)`: - Writes the event to a specified file path, using the specified format. - Parameters: - `file_path` (str): The path to the output file. - `format` (str): The format to write the event file. - Example usage is provided in the docstring. - **WaveformData Class**: - `find_waveform_files(directory, extensions)`: - Searches the directory for files with specified extensions and returns a list of matches. - Parameters: - `directory` (str): The directory path to search for waveform files. - `extensions` (List[str]): List of file extensions to search for. - Returns: List of file paths that match the given extensions. - Example usage is provided in the docstring. - `read_waveform_from_directory(directory, extensions, format)`: - Uses `find_waveform_files` to get the list of waveform files and reads the first one found, using the specified format. - Parameters: - `directory` (str): The directory path to search for waveform files. - `extensions` (List[str]): List of file extensions to search for.
Author
Collaborator

There is no difference between ObspyDMTdataStructure and PilotDataStructure. Why are both implemented?

There is no difference between `ObspyDMTdataStructure` and `PilotDataStructure`. Why are both implemented?
sebastianw added reference 38-simplify-data-structure 2024-07-16 14:50:20 +02:00
Author
Collaborator

HASH format is now missing in writephases needs to be re-implemented.

HASH format is now missing in `writephases` needs to be re-implemented.
Author
Collaborator

I just saw that obspy.io nowadays supports many of the file formats which are used by PyLoT. Instead of using untested self-written code, PyLoT should exploit the capabilities of obspy.

I just saw that `obspy.io` nowadays supports many of the file formats which are used by PyLoT. Instead of using untested self-written code, PyLoT should exploit the capabilities of `obspy`.
Author
Collaborator

Removed unnecessary code. Trying to get an overview of the data handling to be able to further simplify. These changes might break PyLoT in the first place but fixing makes it better than before.

Removed unnecessary code. Trying to get an overview of the data handling to be able to further simplify. These changes might break PyLoT in the first place but fixing makes it better than before.
Author
Collaborator

HASH format is now missing in writephases needs to be re-implemented.

Done with cb457fc7ec

> HASH format is now missing in `writephases` needs to be re-implemented. Done with [cb457fc7ec](https://git.geophysik.ruhr-uni-bochum.de/marcel/pylot/pulls/39/commits/cb457fc7ec5b5260a0315cd9d472487e0e67e8f1)
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: marcel/pylot#38
No description provided.