Smart-E Robot Arm Assistant: Helps you pick up cubes (WIP Post #1: Voice Interaction with Mozilla TTS and Vosk, LewanSoul 6DoF Arm python library, ArucoTags, OpenCV)

spent christmas weekend making this demo of a robot picking up cubes (it doesn’t know which color cube is where beforehand)

Wellll okay this could be a long post but I’m pretty sleep so I’ll just “liveblog” it and not worry too much about making it a tutorial / step-by-step instructions / details. (it’s a two day hackathon project anyway)

two out of three of my roommates are also roboticists, so we (the three of us) started on a house project: a robot arm.

one of us want to focus on designing the arm from scratch (started on cable driven, now focused on cycloidal gearboxes) , and two of us wanted to get started on the software side (divide and conquer). so we got a commercial robot arm for $200: the lewansoul 6dof xarm.

i decided over christmas that it was time to give up on work and do side projects. so i spent two days and hacked together a bare minimum demo of a “helper robot.” we eventually want something similar to dum-E robot in iron man, a robot that can hand you tools etc.

the minimum demo turned out to be verbally asking the robot to pick up a cube of a specific color

the name

i felt mean to call a defenseless robot “hey dummy” all the time, so went with suggestion to use “hey smarty” instead …

probably the dum-e reference will be lost on everyone, since I only remembered jarvis in iron man before starting this project

first print some cubes

T^T the inevitable thing that happens when you get a 3d printer, you crush your meche soul and print cubes. then cover in tape to get different colors

lewansoul robot arm python library

there are several python libraries written, we worked with

which implements already the IK (inverse kinematics) so that we can give x,y,z coordinates instead of joint angles for each of the six joints. runs smoothly (after fixing a udev permissions issue). the library’s calibration required us to move the servo horn a little from how we initially put the robot arm together.


$ git clone
$ cd nuro-arm
$ ipython
 from nuro_arm.robot.robot_arm import RobotArm
 import numpy as np

    robot = RobotArm(); 
    robot.passive_mode() ;   
    # now robot servos do not fight you. physically move robot arm to desired pose
    test_pose_xyz = np.round(robot.get_hand_pose()[0], decimals=4); 
    print('---> ', test_pose_xyz, ' <---')
    POS_MID = [0.200, -0.010]
    Z_SUPER_HI = 0.10
    OPEN = 0.5
    robot = RobotArm()
    xyz = np.append(POS_MID, Z_SUPER_HI)

additional things to try: provides GUIs

python -m nuro_arm.robot.record_movements
python -m nuro_arm.robot.move_arm_with_gui

NOTES if have difficulty seeing robot on ubuntu, re: udev, on 20.04

sudo vi /usr/lib/udev/rules.d/99-xarm.rules
SUBSYSTEM=="hidraw", ATTRS{product}=="LOBOT", GROUP="dialout", MODE="0666"
sudo usermod -a -G dialout $USER
sudo udevadm control --reload-rules && sudo udevadm trigger

pip install easyhid
sudo apt-get install libhidapi-hidraw0 libhidapi-libusb0 # maybe needed?

(on 18.04 it is /etc/udev/rules.d)


import easyhid
en = easyhid.Enumeration()
devices = en.find(vid=1155, pid=22352)
print([dev.description() for dev in devices])

# should something like
# devices ['HIDDevice:\n    /dev/hidraw2 | 483:5750 | MyUSB_HID | LOBOT | 496D626C3331\n    release_number: 513\n    usage_page: 26740\n    usage: 8293\n    interface_number: 0']

make sure robot is plugged into laptop

voice interaction

the other part of the demo is voice interaction. two parts:

speech to text: recognize the trigger phrase and the commands after. this is modelled after the amazon alexa, apple siri, google home style of voice interaction with “AI”, where you say a trigger phrase.

now based on our inspiration it would be “hey dum-e” but this felt pretty mean, so was suggested to use “hey smart-e” instead πŸ™‚

voice interaction part A: generating smart-E’s “voice”

from an earlier project (creating a welcome announcer in the lobby for the for ICRA 2020) I knew I could use mozilla TTS to generate nice-sounding voice from text.

(as opposed to the default “espeak” which sounds like horrible electronic garble)

was rather annoying install at the time (a year ago?) but seems to be a cleaner install this time around, however!! all the instructions are for training your own model, vs I just wanted to use it to generate audio.

So here’s a quickstart to using Mozilla TTS.

~$ python3 -m venv env && source ./env/bin/activate && which python && pip3 install --upgrade pip && pip3 install wheel && echo done

~$ pip install tts

~$ MODEL="tts_models/en/ljspeech/glow-tts"
~$ tts --text "What would you like me to do?" --model_name $MODEL
~$ tts --text "What would you like me to do?" --out_path "./whatwouldyoulikemetodo.wav"
 > tts_models/en/ljspeech/glow-tts is already downloaded.
 > Downloading model to /home/nrw/.local/share/tts/vocoder_models--en--ljspeech--multiband-melgan
 > Using model: glow_tts
 > Vocoder Model: multiband_melgan
 > Generator Model: multiband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: What would you like me to do?
 > Text splitted to sentences.
['What would you like me to do?']
 > Processing time: 0.7528579235076904
 > Real-time factor: 0.3542274926029484
 > Saving output to whatwouldyoulikemetodo.wav
$ play tts_output.wav

voice interaction part B: giving smart-E “intelligence”

hot stars, in the last year or two we gained a realllllly nice speech2text library. used to be that if you wanted reasonable real-time-ish text transcription, you needed to get a google API key and use google API.

now you can just pip install the Vosk library, download a 50 mb file, and get real-time offline transcription that actually works! and if you want Chinese transcription instead of English, just download another 50 mb model file.

i’m liking this future πŸ™‚


$ git clone
$ cd vosk-api/python/example
$ wget
$ unzip
$ mv vosk-model-small-en-us-0.15 model
$ pip install sounddevice
$ python

now to make the keyword recognition happen.

def find_keyword(voice_result):
    keywords = ['marty', 'smarty', 'smart']

(In a while True loop -- see in the vosk repo)
    if rec.AcceptWaveform(data):
        sentence = rec.Result()
        if not state_triggered:
            found_keyword = find_keyword(sentence)
            if found_keyword:
                state_triggered = True
# if state_triggered, then move on to listening for colors

arucotags in python

how to locate the cubes? okay, actually i simplified this step in interest of finishing something in a weekend (before leaving for home for the holidays), and only had three positions. instead the issue was finding which color cube was where.

how to distinguish yellow cube from yellow stickynote? or black cube from checkerboard? i solved it by putting tags on the cubes and sampling (in 2d space) the color slightly above the tags to determine cube color.

I made a quickstart here for working with arucotags in python:

note that the gist shows how to get 3d out, vs for the hackdemo i only used the 2d.

what color is the cube

this was actually annoying. in the end i assume there are only three colors and do a hardcoded HSV filter. Specifically, use hue to separate out the yellow, and then value to separate black from blue. (this is hack – note that the logic is flawed, in that black can be any hue)

crop = image[y-rad:y+rad, x-rad:x+rad]
color_sample = crop.mean(axis=0).mean(axis=0)
rgb = color_sample[::-1] # reverse, color is in BGR from opencv
rgb_pct = rgb / 255
hsv_pct = colorsys.rgb_to_hsv(*rgb_pct) # fxn takes 3 values, so unpack list with *
hsv = np.array(hsv_pct) * 255
print('rgb', rgb, 'hsv', hsv)

hue = hsv[0]
val = hsv[2]

closest_color = None

if 10 < hue < 40:
    closest_color = 'yellow'
if hue >= 50:
    if val > 90:
        closest_color = 'blue'
        closest_color = 'black'

additional notes

i’ll think about making a public snapshot of the git repo for the demo, perhaps. for the above code samples, my file directory looks like so:

$ ls
speech_output # wav files from mozilla tts
$ python


The pop culture reference for smart-E

this is the voice interaction, 38s sec to 62 sec

for longer context of the dum-e scenes, see

creepy voice output

it’s a deep learning model, so it gives unpredictably different results for hey! vs hey!! vs hey!!! and seems to do poorly on short phrases in general. (also you can’t force any specific intonations, like a cheerful “hi there!” vs a sullen “hi there”)

here’s the creepy output of the default model with just “hey”

funny bugs

i got the voice recognition to recognize the colors, but it seemed like the program would get caught in a loop where it would reparse the same audio again and again. i spent a while going “??? how do python queues work, do they not remove items with get()???” and clearing queues and such. eventually my roommie pointed out that, i have it give the response “okay picking up the yellow cube” so it was literally reacting to its own voice in an infinite loop 0:

it was fun to think about the intricacies of all the voice assistants we have nowadays

next steps

on the meche side, build an actual arm that can e.g. lift a drill (or at least a drill battery)

on the cs side, we will continue working with mini arm. bought a set of toy miniature tools (used for kids). we’ll use probably reinforcement learning to learn 1. what pixels correspond to a tool 2. which tool is it 3. where to grasp on the tool. and then 4. how to add new tools to workspace

decisions: match cad models? or pixels? to determine which tool

on the voice side we also have to do some thought about how people will interact with it. “hand me the quarter-inch allen wrench” or “hex wrench” or “hex key” or “t-handle hex” or “ball-end hex” etc., asking for clarification, …

currently we just figured out how to go from webcam image (x,y pixels) to robot frame (x,y,z millimeters) so you can click on the camera image and the robot will move to that point in real life

Pandemic Diary #73 – travel, omicron, classes (2 Jan 2022)

it’s 2022! here’s hoping it’ll be a great year compared to the bletch of 2020 (oh god has it been nearly two years already?!) and the meh of 2021

jeez haven’t diaried in a while

am in GA, flew back. only half-heartedly attempted to quarantine around parents. i did get a proper n95 mask though, and (mouthwash? according to parents? i guess omicron is more upper respiratory but ???), stripped off a layer of clothes, and hot shower when i got back. kinda wore a mask the first two days but still had meals together so,

i guess since we are all roughly at max strength for our boosters (2-3 weeks ago), and omicron is the most transmissable but possibly less deadly, so i’ve been laissez-faire. i guess if you want to only spend a week at home, then it’s not really compatible with actual quarantine. my parents both work remotely, so the risk of starting a community cluster is relatively low.

took a quickvue test ~ day 3 (~84 hrs in). at-home antigen test, analog readout – you read the stripes, if just blue shows up you are tentatively negative, if blue and pink show up you almost certainly have covid. got a negative result. will test again pre-flight tuesday.

at-home tests are hard to find online but still pretty available in-stores. appointments for professional tests are one-week out scheduled.

got guilted/dragged into indoor lunch… -___-;; sketchy, GA vax rate is ~50% (MA is ~75%, malta was ~90%, for two shots). mom did not go b/c immunocompromised, last last new years she got pneumonia and sepsis and a stay in the ICU from a holiday dinner. place was spaciously fairly empty at least. 2-3 empty tables between occupied ones. stressful. i realized my main concern actually now with catching covid, post-vaccination, is having to wait two extra weeks to get better before i can catch a flight back to boston. i can’t keep slipping two weeks of progress everywhere, and definitely unproductive in a stressful way at home.

parents have made changes: no longer shower after get back from stores. and no longer need to decontaminate groceries.

risk levels are changing. i will take a class in spring, have to teach two classes in that case. unlikely it’ll be remote. tested weekly. but still, classes with maybe a hundred+ students… (? i think? no idea how it works since haven’t been to class since 2019). very different environment than, as senior grad student, going in to lab and interacting with at most the same 10 people. don’t feel great about it. i think i’m going to really hate the first quarter of 2022. need to figure out something to look forward to. maybe life will surprise me.

ankle is doing better. maybe there was a mini-fracture and it healed, who knows, doesn’t matter. still achy-painful 6.5 weeks in to walk up and down the stairs more than a few times, but not enough to stop me from going downstairs and snacking lol.

(went home for new years, landing at the wrong end of terminal A & having to walk 26 gates felt like a marathon, stopped multiple times)

more new cases in a week in the US than at any time during the pandemic. ICUs on divert status. a little tired of the added friction to see people, to consider people’s circles, masks, check if people are okay with one thing or the other, just tired and want to not care about the future, and savings goals, and career ladders, and life trajectories, and how people perceive me, and whether i’m spending my time in an optimized way

well, also it’s late so i’m tired

happy 2022

As of 3 Feb 2022
As of 3 Feb 2022