FAQs about Mapping and Ingestion

What are the types of data supported in Catalogue?

Zeotap typically supports the following types of data:

Events – These are transactional data relating to user actions that are recorded against the user identifiers. Apart from attributes that describe the action (like event name, timestamp, what was the object of the event), events can also include other information that provides the following type of contextual information:
- Page – This includes additional page information like referrer, name and so on, where the event occurred.
- Device – This includes device information like browser type, device model on which the event occurred.
- Campaign – If the event being tracked is on a campaign delivery (like viewing an ad, clicking play on a video), then the other attributes to provide context can include campaign name, creative name and other such campaign details.
Users – User data includes identifiers and profiles.
- Identifiers – These are used for unifying users and populating segments with IDs for reach.
- Profile – These are profile attributes of a user, such as gender, customer category, and so on.

More details on this can be found in the taxonomy section.

What are the different source statuses changes observed during mapping?

A source typically has the following three statuses:

Created
Integrated
Mapped

When a created source starts receiving data, it moves to the integrated state. This triggers the preview and the schema services.Once the schema detection is complete, you can view the Map to catalogue button. On clicking the same, you are taken to the mapping screen with the input columns pre-loaded on the left and the catalogue fields on the right that can be chosen as required. Click therefresh schema button if you do not see an expected column.

What is preview in Catalogue?

The preview is generated based on the raw data sent to the bucket and not the processed data. Using the preview you can verify if the data looks as expected, before proceeding with the ingestion.The preview is generated on every new file drop, based on the last five files dropped of random 100 records. This is to ensure visibility into both older and newer data.

What is schema refresh in Catalogue?

Our schema detection service always calculates the schema on every file drop and adds it to your mapping section on the input columns side. If you do not see it, click the refresh schema button to check for the same.

What are the different mapping relationships?

The following are the mapping relationships that are supported:

One-to-one – This is supported by default. You can map any source field to a target column of your choice. Once chosen, the same target catalogue field does not reappear for the other source columns.
One-to-many – If the same source column must be mapped to multiple catalogue fields, then that can be done by selecting all such target fields.
The first catalogue field selected as the target determines the others shown in the drop-down menu, based on the data format. For example, if you select purchase time as the target field, then all the other fields to which you can additionally map the input column, are filtered to only show the timestamp formatted fields. All the others depend on the data format of the first target.
Currently, the enricher added together work for all the target columns. We are working towards adding an enricher per target column to decide transformations at that level.
Many-to-many - This is currently not supported. We will add support for this by providing additional enrichers to provide a ranking or a combination logic for the same.

Can you provide more information about ingestion pipeline frequency?

For Streaming Sources (Web JS, Pixel, HTTP API)

The following are some important information about the ingestion pipeline frequency of streaming sources:

Smart pixel takes the write key from Collect and uses that to determine the pub/sub queue to push data to. When it receives the traffic, it is automatically pushed to this queue according to the chosen Region.
From pub/sub, the provisioned dataflow consumes the data every four hours to make it a batch file and dumps the same into the corresponding GCS bucket as Region(s).
All the streaming data are sent to a pub/sub queue.
These are batched and dropped to the cloud storage buckets every six hours
Since streaming data contains multiple such file groups throughout the day, it is set up to ingest once per day, as per a fixed schedule.
When the mapping is saved for the first time, it triggers immediate ingestion of all data collected till that point and schedules the pipeline to run once per day at a random time.
Subsequently, it runs once every 24 hours and ingests files from yesterday and today.

There can be some duplication of records when it is first set up and on updates. UTC is the time zone for all flows starting from the data flow to ingestion pipelines. Examples

06/01 - Files: f1, f2, <= Ingestion created at t1, will ingest f1,f2 ; New files dropped: f3, f4
06/02 - Files: f5, f6, <= Ingestion scheduled to ingest yesterday’s data, will ingest f3, f4 (+ f1, f2 again) ; New files dropped: f7, f8
06/03 - Files: f9, f10, <= Ingestion scheduled to ingest yesterday’s data, will ingest f5, f6, f7, f8; New files dropped: f11, f12

For Batch Sources (Flat File)

All data files can be directly uploaded to the cloud storage bucket or scheduled from the client-side.Since the delivery cadence may vary depending on the origin system and type of data being sent, ingestion of the same is triggered by a file drop. The following points provide some important information about the ingestion pipeline frequency of batch sources:

When the mapping is saved for the first time, it triggers immediate ingestion of all data collected till that point. However, the pipeline is not scheduled as file updates may happen at different frequencies.
Each file drop triggers the deployment of the ingestion pipeline based on the saved mapping.
Multiple file drops trigger their corresponding ingestions individually.

Currently, the following two enhancements are under testing for optimising this flow:

Running a deployed ingestion pipeline based on file drop rather than deploying the same.
Sequencing the runs when multiple files are dropped in order to ensure successful completion.

How to perform data pause and ingestion pipeline pause?

Streaming ingestions can be paused and resumed as required, especially for cases where the website or app is undergoing structural changes or a campaign source is no longer relevant. Since batch sources trigger the ingestion based on the file being dropped, pausing of ingestion is not applicable for the batch sources.

Dataflows Pause

You can pause a streaming source from the Collect UI whenever the streaming source does not send data for some period of time.

Ingestion Pipeline Pause

This is currently not available on the UI. Dataflow pausing automatically take care of the ingestion part.

​For Streaming Sources (Web JS, Pixel, HTTP API)

​For Batch Sources (Flat File)

​Dataflows Pause

​Ingestion Pipeline Pause

For Streaming Sources (Web JS, Pixel, HTTP API)

For Batch Sources (Flat File)

Dataflows Pause

Ingestion Pipeline Pause