Purrbuddy

Hi there and thanks for stopping by to check out Purrbuddy! My name's Isaac and I'm a software engineer with a strong interest in design and innovation. This project was a way for me to dive deeper into machine learning through a hands-on experiment, one that also happens to involve Archie the cat as an unexpectedly cooperative (sometimes) research partner.

Archie is my partner’s cat, though he goes by many names: Archibald, Baldy, Archibaldy Bald... you get the idea. The core idea behind this project was to build a smart cat collar that uses machine learning to classify a cat’s activity in real time. With that data, we could log his behavior throughout the day and generate insights like how long he spends eating, drinking, running, sleeping, or just loafing around. Over time, this data could even be used to spot trends or irregularities potentially supporting early health or behavior intervention.

The collar is equipped with a microcontroller which has an inbuilt accelerometer and gyroscope. It is able to stream data from these sensors to my computer while Archie goes about his day. Meanwhile, I record video footage of him which allows me to manually label each timeframe with the corresponding activity (e.g., “2025-10-10T10:30:21.356 - 2025-10-10T10:30:24.555 — EATING”). I then use a script to align these labels with the sensor data producing a labeled dataset. From there, I process the data, engineer features, and train a machine learning model that can predict Archie’s activity in real time.

Although I didn’t follow it exactly, I’d like to acknowledge the authors of this paper: “The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Cats (Felis catus): A Validation Study” link here. Their methodology played a key role in shaping my approach, and their work provided a solid foundation that helped guide my direction with confidence.

Important

For context, it's worth mentioning that this project is ongoing and has gone through two main iterations so far. The first was my initial attempt, which unfortunately led to a lot of discarded data (you can read more about that under Flaws and Future Considerations – Hardware Consistency). The current phase is a fresh start, building on the lessons from that first run and restarting the data collection process with a more refined approach.

Establishing Labels

Before training a model, I had to answer an important question: Which activities are actually useful for someone to know about their cat?

After reading various online resources on feline behavior, doing some ChatGPT brainstorming, and experimenting with predictive accuracy during the first iteration of this project, I have currently narrowed it down to the following activity labels:

grooming
itching (scratching self)
ground_sniffing
kneading
littering
shaking
scratching (e.g. scratching a post)
running
walking
trotting
bum_pats
eating_drinking
deceased
jump_up
jump_down
still
unsure

There are certainly more activities I could have included, but these represent the most common and distinguishable activities in Archie’s typical environment. The goal was to strike a balance between ambition and feasibility.

Grouping Rare or Unclear Behaviors

These labels are admittedly optimistic as I’m still unsure how much time I’ll ultimately spend collecting and labeling data to achieve a wide range of activities with a solid class balance. Throughout both iterations of the project, I’ve been using a common strategy: grouping rare or difficult-to-label behaviors under broader labels until there’s enough data to split them out properly. These broader labels are:

active_light – for slower, more predictable movements.
active_chaotic – for fast, unpredictable bursts of activity.

While not ideal, these broader labels were necessary both to keep the project moving and to preserve my sanity. Cats do a lot of strange, hard-to-classify things. So when I can't confidently assign a specific label, or when there just isn’t enough data for an activity to stand on its own, it goes into one of these broader buckets.

That said, I’ve noticed that some distinct activities (like jumping) still get classified quite accurately, even when there’s not a lot of data for them. So I’ve been keeping those as their usual separate labels rather than lumping them into broader categories. Of course, this might change as I collect more data, but it’s what I’ve found to work so far.

Label Justifications & Additional Commentary

Some labels in the set deserve a bit of extra explanation, not from a medical perspective (I’m not a vet and don’t want to spread misinformation), but from a practical point of view:

itching – This is often a sign that the collar has shifted up the neck and may need to be repositioned. Including this can help detect orientation issues in addition to behavior.
shaking – Similar to itching, shaking is another good indicator that the collar has moved or is bothering the cat.
ground_sniffing – I included this mainly to help the model distinguish between sniffing and eating. I considered permanently grouping it under active_light, but since the model could predict it with decent accuracy, I kept it as its own label.
bum_pats – I was curious to see how well the model could detect human interaction. Tracking this behavior could be useful down the line, or just a fun stat that could encourage positive and more frequent human interaction.
eating_drinking – Eating and drinking were combined into one label since, from motion data alone, they’re nearly impossible to tell apart. With a large enough dataset in the future, I might revisit this and split them.
deceased – No, Archie is fine. I simulated this by placing the collar on a desk, completely still, in various orientations. It was easy to collect and could serve as a way to know the collar has become detached.
jump_up / jump_down – These are common cat behaviors that I wanted to track, but I also included them because I thought it could be fun to later trigger a Super Mario-style jump sound every time he leaps.
unsure – This label is used when Archie is out of the camera’s view or when the collar shifts too far around his neck to get reliable data. Any data labeled “unsure” is excluded from training.

Lastly, there’s a “resting” output. This is not a real label, but a post-processing state. When the model detects Archie has been still for an extended period, I translate “still” into “resting”.

Labeling with Overlapping Sliding Window Technique

In the first iteration of this project, I labeled the activity data in a very strict, idealised way. For example, if Archie was eating, I would only label the parts where he was clearly eating and any ambiguous or borderline moments were discarded. My thinking at the time was to train the model only on clean, unmistakable examples of each activity, and let it "guess" the rest. However, I later realised I was limiting the model’s potential, especially when it came to transitions between activities.

Overlapping Sliding Window in this context is a method where you break a continuous stream of data into overlapping windows (e.g., 1-second windows). You then extract statistical features (like min, max, skew, etc.) from each window and use those as inputs for your model. This allows you to capture the shape and transition of movement over time, rather than just relying on isolated data points.

The problem was that with gaps in my original labeled data, the sliding window couldn’t be applied consecutively. By excluding ambiguous moments, I was creating holes in the dataset that made this technique ineffective. So I changed my approach in the second iteration. I started labeling activities more continuously, even during transitions, allowing me to fully take advantage of the sliding window method, giving the model more training data and a better understanding of activity transitions.

So far, this technique has delivered noticeably better performance compared to the method I used in the first iteration, even with a similar amount of data.

Data Collection

The data collection process is fairly straightforward. I start with some basic calibration, strap the collar onto Archie, and let it stream accelerometer and gyroscope data to my computer while I film him. To allow me to align the video footage with the sensor data, I use a camera app called “Timestamp Camera” that overlays a millisecond-precision timestamp in the corner of the footage.

Feature Development

While some machine learning models, particularly deep learning architectures, can learn directly from raw time-series data, traditional models like decision trees, random forests, or Extra Trees typically perform better when given structured, hand-engineered features.

Since I am using traditional models in this project, I use feature development where I break the continuous sensor stream into short time windows and summarise each window down to a set of features. This turns raw accelerometer and gyroscope readings into a structured format the model can easily learn from.

For example, let’s say we collect 100 consecutive readings from the accelerometer and gyroscope. Rather than feed all 100 rows into the model, we aggregate that data into a single row of features that capture the movement patterns during that window. One simple feature might be the average acceleration on the X-axis. When there is more than one distinct activity within the window, the label that is applied is obtained by choosing the most predominant activity. There are various methods which a label can be chosen in this circumstance, but I chose this method because it was simple to implement and fundamentally makes sense.

Here’s a sample of the kinds of features I extracted, many of which are commonly used in activity recognition tasks:

Basic statistics for each axis of acceleration and gyroscope data (X, Y, Z) - Mean, minimum, maximum, standard deviation, skewness, kurtosis.
Correlation between axes - e.g. correlation between Ax and Ay, Gx and Gz, etc.
Energy and movement-based features - SMA (Signal Magnitude Area), ODBA / VDBA (Overall/Vectorial Dynamic Body Acceleration) and AVM (Average Vector Magnitude).
Tilt-based orientation features - Mean tilt in X and Y, Gyroscope tilt, Complementary tilt.

Choosing a Window Size

I experimented with two different window sizes: 1-second and 0.5-second windows. Here’s what I found:

Shorter windows (0.5s) - Better at capturing quick, high-intensity actions like jumping or shaking. These activities are over in a flash, so smaller windows help the model "see" them more clearly.
Longer windows (1s) - More effective for detecting slower, continuous activities like eating or walking, where more context improves prediction accuracy and reduces misclassifications.

Handling Variable Sensor Read Rates

The Arduino streams data from the IMU at around 104Hz, that’s roughly 104 readings per second. But in practice, the number of readings can vary depending on factors like battery level, hardware imperfections, distance from the receiver, and wireless interference.

For example, the data collection script might receive 50 readings in one second and 140 readings in the next. My thinking was this variability could possibly corrupt the consistency of feature windows if you just split the data based on row count or time windows without considering the group size.

To solve this, I use a time-based sliding window technique with group size consideration. That means each window includes all readings that fall within a specific time span, regardless of how many rows that turns out to be, then within each window, I calculate and store the group size (number of rows within the window) as an additional feature. Not only can this be used as a feature, but a quality check. If the group size falls outside a safe threshold (e.g., too few or too many readings), I discard that group from training.

When running the model in real time, if a group size is out of bounds, I flag the prediction as “possibly inaccurate”. This way, I preserve the integrity of my training data and help ensure the model isn’t learning from windows with limited or noisy input.

Undersampling & Oversampling

To train a well-balanced model, it’s important to avoid having too much of one activity and too little of another, this imbalance can bias the model toward predicting the more common classes. Ideally, each class should have roughly the same amount of training data. To get closer to this goal, I use a combination of undersampling and oversampling.

Undersampling - This involves reducing the amount of data from overrepresented classes. In my case, I used a simple random undersampling approach which randomly removes entries until the class size is closer to the target. There are more sophisticated techniques available, but this worked well enough for my dataset.
Oversampling - This is where we increase the number of samples for underrepresented classes. I used SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic examples rather than simply duplicating existing ones. I did try basic duplication early on, but it often made the model overfit and behave unpredictably in practice.

As for deciding how much data each class should have, it depends on the total amount available, but a rough rule of thumb I follow is to calculate the mean number of samples per class and aim for that as a baseline target across all classes.

Building the Model

To build the model, I used PyCaret which is a beginner-friendly machine learning Python library that made the process really straightforward. It basically let me say, “Here’s my data, figure out the best model!” You feed it your dataset, and it automatically tries out a range of models, tunes them, and selects the one that performs best.

I used an 80/20 train/validation split, meaning the model trains on 80% of the data and is validated on the remaining 20%. This is a common approach, but you can adjust the ratio depending on your needs, each option comes with trade-offs.

Once I get the validation results, I generate a report with performance metrics. After that, I retrain the chosen model on the full dataset (100%) to make the most of all the available data. So far with both iterations of this project, Light Gradient Boosting Machine (LightGBM) and Extra Trees Classifier consistently ranked as the top performers.

Results

These results are based on the first 30 minutes of recorded footage from the second iteration of this project. The model currently recognises five activities that were well-represented during this initial session:

active_light
still (and resting)
eating
ground_sniffing
walking

This list will continue to grow as I collect more data over time.

Take a look at the short video below to see the model in action! For just 30 minutes of data, I’m very happy with how well it’s already performing. I can’t wait to start adding some of the more playful activities back, such as jumping, running, and of course, those all-important bum pats!

Footage of model running in real time (not live footage)

Model Metrics

Model	Accuracy	AUC	Recall	Precision	F1	Kappa	MCC
Light Gradient Boosting Machine	0.9454	0.9973	0.9454	0.9455	0.9449	0.9317	0.932

Accuracy - The percentage of times the model correctly predicts the activity out of all predictions made. It shows overall performance but can be misleading if some activities occur much more often than others.
AUC (Area Under the Curve) - Measures how well the model can distinguish one activity from the others.
Recall - Of all the times the cat actually performed a certain activity, how many times did the model correctly identify it?
Precision - Of all the times the model predicted a certain activity, how many were actually correct?
F1 Score - A balanced score that combines precision and recall.
Kappa - Shows how much better the model is at predicting activities compared to just guessing randomly.
MCC (Matthews Correlation Coefficient) - A reliable measure that evaluates the model’s performance even if some activities happen much more often than others. It ranges from -1 (worst) to 1 (perfect).

Prediction Score & Accuracy by Class

Class	Avg Prediction Score	Avg Prediction Score (Correct)	Avg Prediction Score (Incorrect)	Accuracy
active_light	0.9509	0.9693	0.8463	0.85
eating	0.9833	0.9922	0.8269	0.9461
ground_sniffing	0.9959	0.9963	0.8993	0.9961
still	0.9917	0.9945	0.8131	0.9846
walking	0.9886	0.9945	0.8775	0.95

Avg Prediction Score - Average confidence in all predictions.
Avg Prediction Score (Correct) - Average confidence in correct predictions.
Avg Prediction Score (Incorrect) - Average confidence in incorrect predictions.
Accuracy - Correct predictions ÷ total predictions.

Prediction Count by Class

Class	# active_light	# walking	# still	# eating	# ground_sniffing
active_light	221	10	24	4	1
eating	6	0	0	246	8
ground_sniffing	0	0	0	1	259
still	4	0	256	0	0
walking	11	247	0	1	1

These numbers show how often the model predicted each activity for every actual activity. For example, when the true activity was active_light, the model: correctly predicted active_light 221 times, incorrectly predicted eating 4 times, and incorrectly predicted ground_sniffing 1 time, ...and so on.

Flaws & Future Considerations

This project has been a massive learning experience, and several areas have emerged where improvements could make a big difference. Here are some of the key lessons and ideas for future development:

Hardware Consistency

One of the biggest lessons was the importance of a rigid and consistent hardware assembly process. In the first iteration of the project, I had to discard all my labled data (it was quite a lot) because the Arduino had become loose inside the housing. After reassembling it, the model’s accuracy dropped significantly. I realised the new orientation wasn’t identical to the original, and even small changes in sensor alignment caused major differences when running the model on the data collected prior. To avoid this in the future, a precise, repeatable assembly method is essential.

Collar Stability

The collar tends to shift during wear, often creeping up the side of Archie’s neck. This can affect how the sensors interpret motion. A future version of the collar should incorporate better weight distribution and improved fastening methods to keep the device in a consistent position throughout the day.

Model Architecture Exploration

So far, I’ve mainly used tree-based models like Extra Trees and LightGBM. In the future, I’d like to explore 1D Convolutional Neural Networks (CNNs), which are well-suited for time-series data and may offer better performance without the need for feature development.

Timestamp Accuracy

Currently, timestamps are assigned by the data collection script on my computer, not directly by the collar. This introduces a small delay between when the sensor data is captured and when it's logged. Initially, I tried keeping time on the device itself, but it wasn’t reliable and tended to drift. Going forward, adding a dedicated Real-Time Clock (RTC) module will hopefully allow for accurate, synchronised timestamps generated directly on the device.

Video Labeling Challenges

Following Archie around with a camera is far from ideal. Not only is it hard to get clear, consistent footage, but the presence of a human (and a camera) often alters a cat’s behavior. A better solution would be to set up a large cat enclosure with multiple stationary cameras running continuously. This would create a controlled environment for activity capture and eliminate the need for constant manual recording.

Feature Selection and Optimisation

I haven’t spent much time analysing feature importance. This is something I want to explore more deeply, as removing unhelpful or redundant features could reduce noise and improve model performance. It could also help simplify the system and make it more efficient for real-time predictions on limited hardware.

Window Size and Step Size Experimentation

I've tested two window sizes so far: 0.5s and 1s, with a 25% window size step. I’d like to do a more comprehensive sweep to understand how different time window sizes affect classification accuracy. Some behaviors are best captured in short bursts, while others may benefit from a longer context window. Additionally, observing how increasing or reducing step size affects the accuracy would be insightful.

Tuning Minimum and Maximum Group Size Thresholds

As mentioned earlier, I discard data groups that fall outside certain minimum and maximum group size thresholds during the aggregation process. Relaxing the thresholds may introduce noisy or inconsistent data, while tightening them not only reduces the dataset, but also affects real-world performance as the model has not “seen” data that is beyond these thresholds. Moving forward I would like to finetune this balance.

Resampling Strategy

I’ve used a combination of oversampling and undersampling to balance the dataset, but I haven’t yet landed on the ideal target number of samples per class. In future iterations, I plan to experiment with different class sample counts to find the sweet spot that avoids both underrepresentation and overfitting.

Microcontroller: Arduino Nano 33 IoT

I chose this board mainly because it’s compact enough to fit on a cat collar and includes the key modules I needed: Bluetooth, an accelerometer, and a gyroscope. I also liked the idea that it has an inbuilt Wi-Fi module, something I haven’t needed yet, but it leaves the door open for future features.

Battery: Polymer Lithium Ion (LiPo) Battery – 3.7V 1100mAh

LiPo batteries are great for compact, high-energy applications like this. 1100mAh was about as large as I could go without making the collar bulky. Any smaller, and I wouldn’t get enough run time. With this setup, I was able to achieve about 12–14 hours of reliable operation.

Voltage Regulator: Pololu 5V Step-Up Voltage Regulator U1V10F5

The Arduino Nano 33 IoT requires 4.5V–21V on the VIN pin, so I needed a step-up converter. I went with this specific regulator because it was the smallest option I could find that reliably output 5V which is just above the minimum voltage required.

Case: Custom 3D-Printed Case (originally a Tic Tac box)

At first, I used a Tic Tac box to house the components, but it turned out to be frustrating to assemble consistently. Eventually, I bit the bullet, bought a 3D printer, and designed my own custom case using Tinkercad. Tinkercad is beginner-friendly, intuitive, and more than capable for what I need.

Collar: VELCRO® Brand ECO Roll – 2.5cm x 3m (Black)

This was a simple Bunnings pick-up. I originally used a standard cat collar, but the buckles caused imbalance and made the device shift around the cat’s neck. The Velcro strap is much more stable and adjustable. However, Velcro around the neck should not be unsupervised as it doesn’t have a breakaway feature. Unlike cat-safe collars that detach under mild force, this setup won’t release if the cat gets caught on something.

So that’s everything for now! I hope you found this read either interesting, insightful, or ideally, a bit of both. This has definitely been one of the most rewarding (and at times, wildly frustrating and borderline soul-crushing) personal projects I’ve worked on. If you have any questions about the project, feel free to reach out via the social links at the bottom. Thanks for stopping by!

Purrbuddy

By Isaac Sampson

scroll down

Straight to the footage?

Introduction

Data Labeling

Feature Development

Model Development

Results

Hardware

Wrapping up

Introduction

Important

Labeling

Establishing Labels

Grouping Rare or Unclear Behaviors

Label Justifications & Additional Commentary

Labeling with Overlapping Sliding Window Technique

Feature Development

Data Collection

Feature Development

Choosing a Window Size

Handling Variable Sensor Read Rates

Model Development

Undersampling & Oversampling

Building the Model

Results & Considerations

Results

Flaws & Future Considerations

Hardware

Wrapping Up

Let's keep in touch!