Best way to label sensor-generated data for machine learning?

Hi everyone, I’m building a dataset from live IoT sensor streams (e.g., motion and environmental data) and want to use Label Studio to annotate the events for downstream model training. For context, I’ve been following this ESP32 motion-detection project https://www.theengineeringprojects.com/2022/03/iot-based-motion-detection-with-email-alert-using-esp32.html, andThe output is timestamped triggers which I’d like to label with categories like false positive, true movement, etc.

I’ve also seen Arduino forum threads and Raspberry Pi community projects where people log sensor data to CSV or cloud dashboards, as well as some IoT discussions about MQTT and REST ingestion patterns. What workflows or configurations have you found effective in Label Studio for handling streaming or large time-series data annotation especially when the labels are event-based rather than image/text?

Recommended Label Studio workflows for streaming / large event-based time-series

1) Treat “streaming” as micro-batch tasks (most reliable)

Label Studio is built around tasks (records) rather than a continuously updating stream. The most effective pattern is:

  1. Ingest live IoT data (MQTT/REST/etc.) into your own storage (DB/object store).
  2. Periodically cut it into labelable windows (for example 10–60 seconds, or “N samples around each trigger”).
  3. Create one Label Studio task per window, and point the task to the data via valueType="url" (CSV) or valueType="json" (embedded/hosted JSON).
  4. Annotators label regions (intervals) or “events” inside the window using TimeSeriesLabels.

This matches how the TimeSeries labeling UI is intended to work (annotate spans on a timeline), and it scales better than trying to label an infinite stream in one task.

Tip: use a “trigger-centered” window (e.g., t_trigger - 2s to t_trigger + 5s) so annotators mostly see the part that matters.

2) Use TimeSeriesLabels for event/interval annotation (true movement, false positive, etc.)

For event-based labels, you typically want to mark a short interval around the event (even if the raw trigger is instantaneous). Label Studio’s TimeSeries region labels are a good fit.

Minimal example (CSV behind a URL):

<View>
  <TimeSeriesLabels name="event" toName="ts">
    <Label value="true_movement" background="#4CAF50"/>
    <Label value="false_positive" background="#F44336"/>
    <Label value="unknown" background="#9E9E9E"/>
  </TimeSeriesLabels>

  <TimeSeries name="ts"
              valueType="url"
              value="$csv"
              sep=","
              timeColumn="timestamp">
    <Channel column="pir" legend="PIR trigger"/>
    <Channel column="accel_mag" legend="Accel magnitude"/>
    <Channel column="temp_c" legend="Temperature"/>
    <Channel column="humidity" legend="Humidity"/>
  </TimeSeries>
</View>

If your timestamps are Unix seconds, set timeFormat="%s" (this is a common gotcha when importing time series). See the Time Series template docs and examples:
https://docs.humansignal.com/templates/time_series.html

3) Keep overview readable with overviewChannels (but note current limitations/bug)

overviewChannels is intended to control which channels appear in the bottom overview/brush panel (not the main plot) per the tag docs:
https://docs.humansignal.com/tags/timeseries

However, there is an open report that overviewChannels can be ignored (showing all channels) in some versions/repros:
https://github.com/HumanSignal/label-studio/issues/8176

Practical guidance:

  • Always pass exact channel column names (not display labels).
  • If it still shows everything, assume you hit the known issue and focus on other ways to reduce clutter (next section).

4) Reduce clutter for many sensors: use MultiChannel and/or engineer “summary channels”

If you have many axes (IMU x/y/z, multiple sensors), showing everything can overwhelm annotators. A common approach:

5) Important for “event-based”: decide instant vs interval and standardize it

Label Studio will store TimeSeries labels as regions with start/end in the exported annotation JSON. Even if your device produces a single trigger timestamp, it’s usually better for consistency to label a small interval (e.g., trigger_time ± 250ms) so reviewers can see context.

Also note for pre-annotations/predictions: server-side validation can be strict about start/end matching the TimeSeries axis (especially when you use timeColumn). If your axis is discrete, fractional times that don’t exist in the underlying time array can be rejected. This behavior is discussed in an internal support summary and aligns with stricter prediction validation:

6) Storage/ingestion patterns: “URL-based tasks” + optional Redis workflow

If your pipeline already produces CSV files (or JSON snapshots) per window/event, host them somewhere reachable and import tasks that reference those URLs.

There’s also a user-reported pattern of using Redis as a source to help with syncing URLs and imports; a discussion around not having direct PostgreSQL integration and using automated conversion + storage appears here:
https://github.com/HumanSignal/label-studio/issues/7263

7) What’s not supported today: true “single stream, label all sensor_ids in one column”

If your CSV contains multiple logical “groups” in a single column (e.g., sensor_id and values stacked) and you want separate plots per group in one task, that specific format has been called out as not supported in a support thread:
https://label-studio.slack.com/archives/CQ8LYPPJS/p1662469110552389

The usual workaround is to reshape data into separate channels/columns (wide format) or separate tasks.