It's typical to need to publish incremental data to inmydata. For example, if you have a dataset with three years of sales data and you want to update it every hour, you wouldn't want to republish the entire dataset each time. Instead, you would prefer to update only the new or modified data since the data was last refreshed.
To enable this, the publisher app introduces the concept of index columns. These can be specified in the final tab of the data definition screen and are used by the platform to determine which existing data should be removed before new data is added.
When new data is uploaded to the cloud from the publisher app, and the Add to the existing data when new data is published option is selected, the following steps are performed:
- The platform generates a set of unique indexes by identifying each distinct combination of values from the columns specified as index columns across all the new data rows.
- It then removes any rows in the existing cloud data where the combination of values in the index columns matches any of the unique indexes created in step 1.
- Finally, the platform appends the new data to the existing data in the cloud.
Choosing your index columns
It is crucial to choose your index columns carefully. Inmydata uses data storage optimized for querying, which results in slower update transactions compared to traditional relational databases. Each unique index combination in a data upload necessitates a separate update transaction. Therefore, having a very large number of unique index combinations can significantly slow down the publishing performance. To manage this, we limit the number of unique index combinations per publishing task to 200,000. If this limit is exceeded, the publish process will fail.
You should aim to have a relatively small number of unique index combinations. For instance, in the screenshot above, there is a single index column, date. The data upload may refresh the last week's data each time it runs, thereby updating the last week in the published dataset. In this scenario, it's crucial to ensure that the query never returns partial data for any given day, as the existing data for the entire day will be removed before the new data is loaded.