Introduction
Hi there and thanks for stopping by to check out Purrbuddy! My
name's Isaac and I'm a software engineer with a strong interest in
design and innovation. This project was a way
for me to dive deeper into machine learning through a hands-on
experiment, one that also happens to involve Archie the cat as an
unexpectedly cooperative (sometimes) research partner.
Archie is my partner’s cat, though he goes by many names: Archibald,
Baldy, Archibaldy Bald... you get the idea. The core idea behind
this project was to build a smart cat collar that uses machine
learning to classify a cat’s activity in real time. With that data,
we could log his behavior throughout the day and generate insights
like how long he spends eating, drinking, running, sleeping, or just
loafing around. Over time, this data could even be used to spot
trends or irregularities potentially supporting early health or
behavior intervention.
The collar is equipped with a microcontroller which has an inbuilt
accelerometer and gyroscope. It is able to stream data from these
sensors to my computer while Archie goes about his day. Meanwhile, I
record video footage of him which allows me to manually label each
timeframe with the corresponding activity (e.g.,
“2025-10-10T10:30:21.356 - 2025-10-10T10:30:24.555 — EATING”). I
then use a script to align these labels with the sensor data
producing a labeled dataset. From there, I process the data,
engineer features, and train a machine learning model that can
predict Archie’s activity in real time.
Although I didn’t follow it exactly, I’d like to acknowledge the
authors of this paper: “The Use of Triaxial Accelerometers and
Machine Learning Algorithms for Behavioural Identification in
Domestic Cats (Felis catus): A Validation Study” link
here. Their methodology played a key role in shaping my approach, and
their work provided a solid foundation that helped guide my
direction with confidence.
Important
For context, it's worth mentioning that this project is ongoing
and has gone through two main iterations so far. The first was my
initial attempt, which unfortunately led to a lot of discarded
data (you can read more about that under
Flaws and Future Considerations – Hardware Consistency). The current phase is a fresh start, building on the lessons
from that first run and restarting the data collection process
with a more refined approach.
Labeling
Establishing Labels
Before training a model, I had to answer an important question:
Which activities are actually useful for someone to know about their
cat?
After reading various online resources on feline behavior, doing
some ChatGPT brainstorming, and experimenting with predictive
accuracy during the first iteration of this project, I have
currently narrowed it down to the following activity labels:
- grooming
- itching (scratching self)
- ground_sniffing
- kneading
- littering
- shaking
- scratching (e.g. scratching a post)
- running
- walking
- trotting
- bum_pats
- eating_drinking
- deceased
- jump_up
- jump_down
- still
- unsure
There are certainly more activities I could have included, but these
represent the most common and distinguishable activities in Archie’s
typical environment. The goal was to strike a balance between
ambition and feasibility.
Grouping Rare or Unclear Behaviors
These labels are admittedly optimistic as I’m still unsure how much
time I’ll ultimately spend collecting and labeling data to achieve a
wide range of activities with a solid class balance. Throughout both
iterations of the project, I’ve been using a common strategy:
grouping rare or difficult-to-label behaviors under broader labels
until there’s enough data to split them out properly. These broader
labels are:
- active_light – for slower, more predictable movements.
- active_chaotic – for fast, unpredictable bursts of activity.
While not ideal, these broader labels were necessary both to keep
the project moving and to preserve my sanity. Cats do a lot of
strange, hard-to-classify things. So when I can't confidently assign
a specific label, or when there just isn’t enough data for an
activity to stand on its own, it goes into one of these broader
buckets.
That said, I’ve noticed that some distinct activities (like jumping)
still get classified quite accurately, even when there’s not a lot
of data for them. So I’ve been keeping those as their usual separate
labels rather than lumping them into broader categories. Of course,
this might change as I collect more data, but it’s what I’ve found
to work so far.
Label Justifications & Additional Commentary
Some labels in the set deserve a bit of extra explanation, not from
a medical perspective (I’m not a vet and don’t want to spread
misinformation), but from a practical point of view:
- itching – This is often a sign that the collar has shifted up the neck and may need to be repositioned. Including this can help detect orientation issues in addition to behavior.
- shaking – Similar to itching, shaking is another good indicator that the collar has moved or is bothering the cat.
- ground_sniffing – I included this mainly to help the model distinguish between sniffing and eating. I considered permanently grouping it under active_light, but since the model could predict it with decent accuracy, I kept it as its own label.
- bum_pats – I was curious to see how well the model could detect human interaction. Tracking this behavior could be useful down the line, or just a fun stat that could encourage positive and more frequent human interaction.
- eating_drinking – Eating and drinking were combined into one label since, from motion data alone, they’re nearly impossible to tell apart. With a large enough dataset in the future, I might revisit this and split them.
- deceased – No, Archie is fine. I simulated this by placing the collar on a desk, completely still, in various orientations. It was easy to collect and could serve as a way to know the collar has become detached.
- jump_up / jump_down – These are common cat behaviors that I wanted to track, but I also included them because I thought it could be fun to later trigger a Super Mario-style jump sound every time he leaps.
- unsure – This label is used when Archie is out of the camera’s view or when the collar shifts too far around his neck to get reliable data. Any data labeled “unsure” is excluded from training.
Lastly, there’s a “resting” output. This is not a real label, but a
post-processing state. When the model detects Archie has been still
for an extended period, I translate “still” into “resting”.
Labeling with Overlapping Sliding Window Technique
In the first iteration of this project, I labeled the activity data
in a very strict, idealised way. For example, if Archie was eating,
I would only label the parts where he was clearly eating and any
ambiguous or borderline moments were discarded. My thinking at the
time was to train the model only on clean, unmistakable examples of
each activity, and let it "guess" the rest. However, I later
realised I was limiting the model’s potential, especially when it
came to transitions between activities.
Overlapping Sliding Window in this context is a method where you
break a continuous stream of data into overlapping windows (e.g.,
1-second windows). You then extract statistical features (like min,
max, skew, etc.) from each window and use those as inputs for your
model. This allows you to capture the shape and transition of
movement over time, rather than just relying on isolated data
points.
The problem was that with gaps in my original labeled data, the
sliding window couldn’t be applied consecutively. By excluding
ambiguous moments, I was creating holes in the dataset that made
this technique ineffective. So I changed my approach in the second
iteration. I started labeling activities more continuously, even
during transitions, allowing me to fully take advantage of the
sliding window method, giving the model more training data and a
better understanding of activity transitions.
So far, this technique has delivered noticeably better performance
compared to the method I used in the first iteration, even with a
similar amount of data.
Feature Development
Data Collection
The data collection process is fairly straightforward. I start with
some basic calibration, strap the collar onto Archie, and let it
stream accelerometer and gyroscope data to my computer while I film
him. To allow me to align the video footage with the sensor data, I
use a camera app called “Timestamp Camera” that overlays a
millisecond-precision timestamp in the corner of the footage.
Feature Development
While some machine learning models, particularly deep learning
architectures, can learn directly from raw time-series data,
traditional models like decision trees, random forests, or Extra
Trees typically perform better when given structured,
hand-engineered features.
Since I am using traditional models in this project, I use feature
development where I break the continuous sensor stream into short
time windows and summarise each window down to a set of features.
This turns raw accelerometer and gyroscope readings into a
structured format the model can easily learn from.
For example, let’s say we collect 100 consecutive readings from the
accelerometer and gyroscope. Rather than feed all 100 rows into the
model, we aggregate that data into a single row of features that
capture the movement patterns during that window. One simple feature
might be the average acceleration on the X-axis. When there is more
than one distinct activity within the window, the label that is
applied is obtained by choosing the most predominant activity. There
are various methods which a label can be chosen in this
circumstance, but I chose this method because it was simple to
implement and fundamentally makes sense.
Here’s a sample of the kinds of features I extracted, many of which
are commonly used in activity recognition tasks:
- Basic statistics for each axis of acceleration and gyroscope data (X, Y, Z) - Mean, minimum, maximum, standard deviation, skewness, kurtosis.
- Correlation between axes - e.g. correlation between Ax and Ay, Gx and Gz, etc.
- Energy and movement-based features - SMA (Signal Magnitude Area), ODBA / VDBA (Overall/Vectorial Dynamic Body Acceleration) and AVM (Average Vector Magnitude).
- Tilt-based orientation features - Mean tilt in X and Y, Gyroscope tilt, Complementary tilt.
Choosing a Window Size
I experimented with two different window sizes: 1-second and
0.5-second windows. Here’s what I found:
- Shorter windows (0.5s) - Better at capturing quick, high-intensity actions like jumping or shaking. These activities are over in a flash, so smaller windows help the model "see" them more clearly.
- Longer windows (1s) - More effective for detecting slower, continuous activities like eating or walking, where more context improves prediction accuracy and reduces misclassifications.
Handling Variable Sensor Read Rates
The Arduino streams data from the IMU at around 104Hz, that’s
roughly 104 readings per second. But in practice, the number of
readings can vary depending on factors like battery level, hardware
imperfections, distance from the receiver, and wireless
interference.
For example, the data collection script might receive 50 readings in
one second and 140 readings in the next. My thinking was this
variability could possibly corrupt the consistency of feature
windows if you just split the data based on row count or time
windows without considering the group size.
To solve this, I use a time-based sliding window technique with
group size consideration. That means each window includes all
readings that fall within a specific time span, regardless of how
many rows that turns out to be, then within each window, I calculate
and store the group size (number of rows within the window) as an
additional feature. Not only can this be used as a feature, but a
quality check. If the group size falls outside a safe threshold
(e.g., too few or too many readings), I discard that group from
training.
When running the model in real time, if a group size is out of
bounds, I flag the prediction as “possibly inaccurate”. This way, I
preserve the integrity of my training data and help ensure the model
isn’t learning from windows with limited or noisy input.
Model Development
Undersampling & Oversampling
To train a well-balanced model, it’s important to avoid having too
much of one activity and too little of another, this imbalance can
bias the model toward predicting the more common classes. Ideally,
each class should have roughly the same amount of training data. To
get closer to this goal, I use a combination of undersampling and
oversampling.
- Undersampling - This involves reducing the amount of data from overrepresented classes. In my case, I used a simple random undersampling approach which randomly removes entries until the class size is closer to the target. There are more sophisticated techniques available, but this worked well enough for my dataset.
- Oversampling - This is where we increase the number of samples for underrepresented classes. I used SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic examples rather than simply duplicating existing ones. I did try basic duplication early on, but it often made the model overfit and behave unpredictably in practice.
As for deciding how much data each class should have, it depends on
the total amount available, but a rough rule of thumb I follow is to
calculate the mean number of samples per class and aim for that as a
baseline target across all classes.
Building the Model
To build the model, I used PyCaret which is a beginner-friendly
machine learning Python library that made the process really
straightforward. It basically let me say, “Here’s my data, figure
out the best model!” You feed it your dataset, and it automatically
tries out a range of models, tunes them, and selects the one that
performs best.
I used an 80/20 train/validation split, meaning the model trains on
80% of the data and is validated on the remaining 20%. This is a
common approach, but you can adjust the ratio depending on your
needs, each option comes with trade-offs.
Once I get the validation results, I generate a report with
performance metrics. After that, I retrain the chosen model on the
full dataset (100%) to make the most of all the available data. So
far with both iterations of this project, Light Gradient Boosting
Machine (LightGBM) and Extra Trees Classifier consistently ranked as
the top performers.
Results & Considerations
Results
These results are based on the first 30 minutes of recorded
footage from the second iteration of this project. The model
currently recognises five activities that were well-represented
during this initial session:
- active_light
- still (and resting)
- eating
- ground_sniffing
- walking
Take a look at the short video below to see the model in action! For
just 30 minutes of data, I’m very happy with how well it’s already
performing. I can’t wait to start adding some of the more playful
activities back, such as jumping, running, and of course, those
all-important bum pats!
Footage of model running in real time (not live footage)
Model Metrics
Model | Accuracy | AUC | Recall | Precision | F1 | Kappa | MCC |
---|---|---|---|---|---|---|---|
Light Gradient Boosting Machine | 0.9454 | 0.9973 | 0.9454 | 0.9455 | 0.9449 | 0.9317 | 0.932 |
- Accuracy - The percentage of times the model correctly predicts the activity out of all predictions made. It shows overall performance but can be misleading if some activities occur much more often than others.
- AUC (Area Under the Curve) - Measures how well the model can distinguish one activity from the others.
- Recall - Of all the times the cat actually performed a certain activity, how many times did the model correctly identify it?
- Precision - Of all the times the model predicted a certain activity, how many were actually correct?
- F1 Score - A balanced score that combines precision and recall.
- Kappa - Shows how much better the model is at predicting activities compared to just guessing randomly.
- MCC (Matthews Correlation Coefficient) - A reliable measure that evaluates the model’s performance even if some activities happen much more often than others. It ranges from -1 (worst) to 1 (perfect).
Prediction Score & Accuracy by Class
Class | Avg Prediction Score | Avg Prediction Score (Correct) | Avg Prediction Score (Incorrect) | Accuracy |
---|---|---|---|---|
active_light | 0.9509 | 0.9693 | 0.8463 | 0.85 |
eating | 0.9833 | 0.9922 | 0.8269 | 0.9461 |
ground_sniffing | 0.9959 | 0.9963 | 0.8993 | 0.9961 |
still | 0.9917 | 0.9945 | 0.8131 | 0.9846 |
walking | 0.9886 | 0.9945 | 0.8775 | 0.95 |
- Avg Prediction Score - Average confidence in all predictions.
- Avg Prediction Score (Correct) - Average confidence in correct predictions.
- Avg Prediction Score (Incorrect) - Average confidence in incorrect predictions.
- Accuracy - Correct predictions ÷ total predictions.
Prediction Count by Class
Class | # active_light | # walking | # still | # eating | # ground_sniffing |
---|---|---|---|---|---|
active_light | 221 | 10 | 24 | 4 | 1 |
eating | 6 | 0 | 0 | 246 | 8 |
ground_sniffing | 0 | 0 | 0 | 1 | 259 |
still | 4 | 0 | 256 | 0 | 0 |
walking | 11 | 247 | 0 | 1 | 1 |
These numbers show how often the model predicted each activity for every actual activity. For example, when the true activity was active_light, the model: correctly predicted active_light 221 times, incorrectly predicted eating 4 times, and incorrectly predicted ground_sniffing 1 time, ...and so on.
Flaws & Future Considerations
This project has been a massive learning experience, and several
areas have emerged where improvements could make a big difference.
Here are some of the key lessons and ideas for future development:
Hardware Consistency
One of the biggest lessons was the importance of a rigid and
consistent hardware assembly process. In the first iteration of
the project, I had to discard all my labled data (it was quite a
lot) because the Arduino had become loose inside the housing.
After reassembling it, the model’s accuracy dropped significantly.
I realised the new orientation wasn’t identical to the original,
and even small changes in sensor alignment caused major
differences when running the model on the data collected prior. To
avoid this in the future, a precise, repeatable assembly method is
essential.
Collar Stability
The collar tends to shift during wear, often creeping up the side
of Archie’s neck. This can affect how the sensors interpret
motion. A future version of the collar should incorporate better
weight distribution and improved fastening methods to keep the
device in a consistent position throughout the day.
Model Architecture Exploration
So far, I’ve mainly used tree-based models like Extra Trees and
LightGBM. In the future, I’d like to explore 1D Convolutional
Neural Networks (CNNs), which are well-suited for time-series data
and may offer better performance without the need for feature
development.
Timestamp Accuracy
Currently, timestamps are assigned by the data collection script
on my computer, not directly by the collar. This introduces a
small delay between when the sensor data is captured and when it's
logged. Initially, I tried keeping time on the device itself, but
it wasn’t reliable and tended to drift. Going forward, adding a
dedicated Real-Time Clock (RTC) module will hopefully allow for
accurate, synchronised timestamps generated directly on the
device.
Video Labeling Challenges
Following Archie around with a camera is far from ideal. Not
only is it hard to get clear, consistent footage, but the presence
of a human (and a camera) often alters a cat’s behavior. A better
solution would be to set up a large cat enclosure with multiple
stationary cameras running continuously. This would create a
controlled environment for activity capture and eliminate the need
for constant manual recording.
Feature Selection and Optimisation
I haven’t spent much time analysing feature importance. This is
something I want to explore more deeply, as removing unhelpful or
redundant features could reduce noise and improve model
performance. It could also help simplify the system and make it
more efficient for real-time predictions on limited hardware.
Window Size and Step Size Experimentation
I've tested two window sizes so far: 0.5s and 1s, with a 25%
window size step. I’d like to do a more comprehensive sweep to
understand how different time window sizes affect classification
accuracy. Some behaviors are best captured in short bursts, while
others may benefit from a longer context window. Additionally,
observing how increasing or reducing step size affects the
accuracy would be insightful.
Tuning Minimum and Maximum Group Size Thresholds
As mentioned earlier, I discard data groups that fall outside
certain minimum and maximum group size thresholds during the
aggregation process. Relaxing the thresholds may introduce noisy
or inconsistent data, while tightening them not only reduces the
dataset, but also affects real-world performance as the model has
not “seen” data that is beyond these thresholds. Moving forward I
would like to finetune this balance.
Resampling Strategy
I’ve used a combination of oversampling and undersampling to
balance the dataset, but I haven’t yet landed on the ideal target
number of samples per class. In future iterations, I plan to
experiment with different class sample counts to find the sweet
spot that avoids both underrepresentation and overfitting.
Hardware
Microcontroller: Arduino Nano 33 IoT
I chose this board mainly because it’s compact enough to fit on a
cat collar and includes the key modules I needed: Bluetooth, an
accelerometer, and a gyroscope. I also liked the idea that it has
an inbuilt Wi-Fi module, something I haven’t needed yet, but it
leaves the door open for future features.
Battery: Polymer Lithium Ion (LiPo) Battery – 3.7V 1100mAh
LiPo batteries are great for compact, high-energy applications
like this. 1100mAh was about as large as I could go without making
the collar bulky. Any smaller, and I wouldn’t get enough run time.
With this setup, I was able to achieve about 12–14 hours of
reliable operation.
Voltage Regulator: Pololu 5V Step-Up Voltage Regulator U1V10F5
The Arduino Nano 33 IoT requires 4.5V–21V on the VIN pin, so I
needed a step-up converter. I went with this specific regulator
because it was the smallest option I could find that reliably
output 5V which is just above the minimum voltage required.
Case: Custom 3D-Printed Case (originally a Tic Tac box)
At first, I used a Tic Tac box to house the components, but it
turned out to be frustrating to assemble consistently. Eventually,
I bit the bullet, bought a 3D printer, and designed my own custom
case using Tinkercad. Tinkercad is beginner-friendly, intuitive,
and more than capable for what I need.
Collar: VELCRO® Brand ECO Roll – 2.5cm x 3m (Black)
This was a simple Bunnings pick-up. I originally used a
standard cat collar, but the buckles caused imbalance and made
the device shift around the cat’s neck. The Velcro strap is much
more stable and adjustable. However,
Velcro around the neck should not be unsupervised as it doesn’t have a breakaway feature. Unlike cat-safe collars
that detach under mild force, this setup won’t release if the
cat gets caught on something.
Wrapping Up
So that’s everything for now! I hope you found this read either
interesting, insightful, or ideally, a bit of both. This has
definitely been one of the most rewarding (and at times, wildly
frustrating and borderline soul-crushing) personal projects I’ve
worked on. If you have any questions about the project, feel free to
reach out via the social links at the bottom. Thanks for stopping
by!