Advanced Data Flows
StepWright provides built-in File I/O operations directly inside the step flow, eliminating the need to write custom Python file handling logic around your browser automation. Let's explore how readData and writeData let you treat your scraping process like a data pipeline.
Supported Formats
StepWright provides native handlers for:
- JSON (
.json) - CSV (
.csv) - Excel (
.xlsx- requiresopenpyxldependency) - Plain Text (
.txt) - Custom Formats (via Callbacks)
readData
Load data into the collector array. This is perfect for feeding URLs or Keywords from a file directly into a foreach loop.
BaseStep(
id="load_file",
action="readData",
value="keywords.csv", # File Path
data_type="csv", # Expected Format
key="queue" # Store the result array in the collector under this key
)foreach (External Lists)
Once data is placed in the collector via readData, you can iterate over it natively using a foreach loop. Use syntax in the Loop's value.
BaseStep(
id="loop",
action="foreach",
value="{{queue}}", # Matches the key from readData above
subSteps=[
BaseStep(id="nav", action="navigate", value="https://example.com/search?q={{item}}"),
BaseStep(id="extract", action="data", object=".name", key="name")
],
key="results"
)writeData
Save your heavily structured output directly back to disk. Use the key to identify exactly which piece of the collector gets exported.
BaseStep(
id="save",
action="writeData",
value="results.json",
data_type="json",
key="results" # The key where 'foreach' deposited the data
)🛠️ Custom Callbacks (Advanced)
Need to parse an XML file or interact with a proprietary API? Provide your own Python logic using a custom callback.
def my_custom_reader(path, step):
import xml.etree.ElementTree as ET
tree = ET.parse(path)
# Must return a List or Dict
return [el.text for el in tree.findall('.//item')]
# Pass the function directly into the BaseStep
step = BaseStep(
id="load-xml",
action="readData",
value="data.xml",
data_type="custom",
callback=my_custom_reader,
key="items"
)