Harvesting
The tool will download source data from all feed end points and store it in the raw_data
table.
It will only download new data when re-run. It stores state of how far it got in raw_next_url
column of publisher_feed
table.
Deletes are soft deletes, marked by the data_deleted
column in the table.
To run this:
$ node ./src/bin/download-raw.js
It can be stopped at any time and it will not leave the database in a bad state. Only records being processed at that particular moment in time would be lost.
When restarted it will pick up where it left off.
Errors
Any errors encountered during this stage will be stored in the download_raw_errors
table.
url
- Where the error occurrederror
- What the error waspublisher_feed_id
- The publisher feed we were fetching when we had an errorerror_at
- What date and time the error occurred
Clear out work already done (Database storage)
Before doing this, you will:
Run the SQL:
To only download a limited set of raw data
You may want to do this to avoid processing too much data. Or to debug a problem that occurs with a specific publisher's data, you may want to get data for that publisher only.
Stop the process early
The process can be forcibly stopped at any point and the database will not be in a bad state. It will contain most work done up to the point you stop it.
Edit the SQL query that determines data to download
If you only want to get data for some or one publisher feed, you can edit the SQL statement in src/lib/download-raw.js
.
In the download_raw_all_publisher_feeds
function there is a SELECT
statement - simply add a WHERE
clause and an appropriate filter.
Last updated