Nerd Ralph

I've been a fan of link-time optimization for several years. I've been a fan of efficient programming for even longer. I was an early fan of C++ because features like function overloading made it easier to move decisions done at run-time in C to compile-time with C++. As C++ has become more complex over the decades, I've become less of a C++ fan, and appreciate the simplicity of C.

For small embedded systems like 8-bit AVRs and ARM M0, run-time error checking with assert() has minimal usefulness compared to UNIX, where a a core dump will help pinpoint the error location and the state of the program at the time of the error. Even if the usability problems were solved, real-time embedded systems may not be able to afford the performance costs of run-time error checking.

Both C++ and C support static assertions. Anyone who has tried to use static_assert likely has encountered "expression in static assertion is not constant" errors for anything but the simplest of checks. The limitations of static_assert is well documented elsewhere, so I will not go into further details in this post.

I had long understood that LTO allowed the compiler to evaluate expressions in code at build time, I never realized it's potential for static error checking. The idea came to me when looking at a fellow embedded developer's code for fast Arduino digital IO. In particular, Bill's code introduced me to the gcc error function attribute. The documentation describes the attribute as follows:

If the error or warning attribute is used on a function declaration and a call to such a function is not eliminated through dead code elimination or other optimizations, an error or warning (respectively) that includes message is diagnosed. This is useful for compile-time checking ...

Despite the fact that it seems the error attribute was introduced to address some of the limitations of static asserts, it doesn't seem to be commonly used. After some experimentation, I came up with a basic example.
pll.c:
__attribute((error("")))
void constraint_error(char * details);

volatile unsigned pll_mult;

void set_pll_mult(unsigned multiplier)
{
if (multiplier > 8) constraint_error("multlier out of range");
pll_mult = multiplier;
}

main.c:
extern void set_pll_mult(unsigned multiplier);

int main()
{
set_pll_mult(9);
}

$ gcc -Os -flto -o main *.c
In function 'set_pll_mult.constprop',
inlined from 'main' at main.c:6:5:
pll.c:9:25: error: call to 'constraint_error' declared with attribute error:
if (multiplier > 8) constraint_error("multlier out of range");
^
When set_pll_mult() is called with an argument greater than 8, a compile error occurs. When it is compiled with a valid multiplier, the "if (multiplier > 8)" statement is eliminated by the optimizer. One drawback to the technique is that the caller (main.c in this case) is not identified when the called function is not inlined. Increasing the optimization level to O3 may help to get the function inlined.

One thing I like about AVR MCUs is that their datasheets are relatively short and simple. It's also one of the things I don't like, because the datasheets often lack important details. Understanding external interrupt latency is one things that is lacking complete and clear details. I decided to investigate the interrupt latency of the ATtiny13 and the ATtiny85. The datasheet's description of interrupt response time and external interrupts is identical for both parts.

Interrupt Response Time

The ATtiny13 datasheet section 4.7.1, under the heading "Interrupt Response Time", says, "The interrupt execution response for all the enabled AVR interrupts is four clock cycles minimum. After four clock cycles the Program Vector address for the actual interrupt handling routine is executed. [...] The vector is normally a jump to the interrupt routine, and this jump takes three clock cycles. [...] If an interrupt occurs when the MCU is in sleep mode, the interrupt execution response time is increased by four clock cycles."

While section 4.7.1 is reasonably detailed, it has one significant error, and another important omission. The error is the sentence, "The vector is normally a jump to the interrupt routine, and this jump takes three clock cycles". All AVRs with less than 8KB of flash, like the ATtiny13, have no jump instruction. They only have a relative jump "rjmp", which takes two clock cycles. This is obviously a copy/paste error from the datasheet of an AVR with more than 8KB of flash. Anyone familiar with the AVR instruction set would likely catch this simple error. The omission from section 4.7.1 is much harder to recognize until you carefully examine section 9.2 and figure 9-1 in the datasheet.

Figure 9-1 shows a circuit which appears to add a latency of two clock cycles to pin change interrupts. There is no written description for the circuit, and the external interrupt details in section 9.2 of the datasheet state, "Pin change interrupts on PCINT[5:0] are detected asynchronously." Since pin change interrupts can be used to wake the part from power-down sleep mode when all clocks are disabled, they must be detected asynchronously during power-down sleep. To determine when they are detected synchronously requires testing.

To test the interrupt latency I wrote a program in assembler that can generate low pulses of different lengths using PWM. I chose not to write the program in C because I want to be able to measure the interrupt latency down to a single cycle. On the t13, PB1 is the pin for INT0, PCINT1, and OC0B. By using OC0B to generate a low pulse on PB1, I'll be able to trigger INT0 and PCINT1 without any external connections. When the interrupt is triggered, it should take four cycles to execute the code at the interrupt vector. That code is an rjmp to the interrupt function, and that rjmp takes two additional clock cycles. For the best-case latency, the first instruction in the interrupt function will execute six cycles after the interrupt is triggered.

The first instruction of the interrupt function checks the state of the pin that triggered the interrupt (the "sbic" instruction). If the pin is low, it skips the next instruction, then goes into an infinite loop. If the pin is high, it toggles the LED pin. Since the PWM is configured to generate a low pulse, if the pulse has ended before the sbic, the LED will light up to indicate the interrupt response time was too slow. The length of the pulse is one cycle longer than the value stored in OCR0B, which is done at lines 28 and 29. My testing consisted mainly of modifying the OCR0B value, then building and flashing the modified code to the AVR.

Results

As expected INT0 latency is 4 clock cycles from the end of the currently executing instruction. This means that if the interrupt occurs during the first cycle of a call instruction which takes 3 cycles, the interrupt response time will be 6 cycles. For pin change interrupts, the latency is 6 cycles, indicating the synchronizer circuit adds 2 cycles of latency. In idle sleep mode, both INT0 and PCINT latency is 8 cycles, indicating pin change interrupts operate asynchronously when the CPU clock is not running.

I've written a few software UARTs for AVR MCUs. All of them have bit-banged the output, using cycle-counted assembler busy loops to time the output of each bit. The code requires interrupts to be disabled to ensure accurate timing between bits. This makes it impossible to receive data at the same time as it is being transmitted, and therefore the bit-banged implementations have been half-duplex. By using the waveform generator of the timer/counter in many AVR MCUs, I've found a way to implement a full-duplex UART, which can simultaneously send and receive at up to 115kbps when the MCU is clocked at 8Mhz.

I expect most AVR developers are familiar with using PWM, where the output pin is toggled at a given duty cycle, independent of the code execution. The technique behind my full-duplex UART is using the waveform generation mode so the timer/counter hardware sets the OC0A pin at the appropriate time for each bit to be transmitted. TIM0_COMPA interrupt runs after each bit is output. The ISR determines if the next bit is a 0 or a 1. For a 1 bit, TCCR0A is configured to set OC0A on compare match. For a 0 bit, TCCR0A is configured to clear OC0A on compare match. The ISR also updates OCR0A with the appropriate timer count for the next bit. To allow for simultaneous receiving, the TIM0_COMPA transmit ISR is made interruptible (the first instruction is "sei").

The receiving is handled by PCINT0, which triggers on the received start bit, and TIM0_COMPB interrupt which runs for each received bit. I wrote this ISR in assembler in order to ensure the received bit is read at the correct time, taking into consideration interrupt latency. If any other interrupts are enabled, they must be interruptible (ISR_NOBLOCK if written in C). I've implemented a two-level receive FIFO, which can be queried with the rx_data_ready() function. A byte can be read from the FIFO with rx_read().

The code is written to work with the ATtiny13, ATtiny85, and ATtiny84. Only PCINT0 is supported, which on the t84 means that the receive pin must be on PORTA. With a few modifications to the code, PCINT1 could be used for receiving on PORTB with the t84. The total time required for both the transmit and the receive ISRs is 52 cycles. Adding an average interrupt overhead of 7 cycles for each ISR means that there must be at least 66 cycles between bits. At 8Mhz this means the maximum baud rate is 8,000,000/66 = 121kbps. The lowest standard baud rate that can be used with an 8Mhz clock is 9600bps.

The wgmuart application implements an example echo program running at the default baud rate of 57.6kbps. In addition to echoing back each character received, it prints out a period '.' every second along with toggling an LED.

I've published the code on github.

When I first read about the CH554 series of MCUs, I thought it would be interesting to test out some day. Part of the attraction is that it's based on the 8051, which is a well-documented an widely used architecture. The first assembly language I learned almost 40 years ago was for the 6502, so learning to program the 8-bit CISC should be relatively easy.

Instead of purchasing the bare chips for pennies at LCSC and putting together a breakout board, I bought a couple modules from Electrodragon. I had learned that the CH551, CH552, and CH554 all used the same die. I bought the CH551 and CH552 modules with the intention of eventually trying to hack them into working as a CH554.

For testing the modules, in addition to the CH554 SDK for SDCC on Linux, I've used Ch55xduino on Windows. One thing not in the Ch55xduino documentation is driver setup. The windoze version I'm using is 7E, and when I first inserted the CH551 module, I got a driver error.

Using Zadig to set the driver to libusb-win32 solved the problem.

The CH55xduino documenation also lacks pinout documentation for anything other than the reverence board. To help, I've copied the pinouts from the CH552 datahseet.

The CH55x bootloader supports DFU, which is what the CH55xduino uploader uses the first time code is uploaded to the module. Once the first sketch is uploaded, the CH55xduino core includes a CDC serial stack. With my CH551 module no longer appearing as a DFU device, I had to use Zadig again to change the CDC Serial device to use the USB Serial (CDC) driver. After that, the module appears as a COM port.

With the COM port selected in the Arduino IDE, subsequent uploads enter the bootloader by switching the baud rate to 1200bps. If no COM port is selected, the upload tool looks for a CH55x device in DFU bootloader mode. To enter the bootloader, it is necessary to pull the USB D+ pin up to 3.3V when power is applied. The Electrodragon boards have a pinout for an upload jumper, which when shorted will connect the D+ pin (P3.6/UDP)to 3.3V through a 10k resistor. On one of my modules I soldered pin headers and use a jumper to force it into upload mode. On the other, I just used a low-value (270Ohm) through-hole resistor pushed into the holes.

Currently CH55xduino is not optimized for size, with a basic blink sketch requiring 5333 bytes of flash. Officially, the CH551 is only supposed to have 10kB of available flash, so the CH55xduino overhead means less than 5kB is left for user code. The CH551 actually seems to have 12kB available for flashing user code, which I think will be plenty if the CH55xduino core gets some optimization work. Since I like to do low-level embedded coding, I'll be using SDCC from the command line most of the time. The blink example in the CH554 SDK for SDCC compiles to 700 bytes, and I was able to get that down to 232 bytes after leaving out the UART initialization in debug.c. With a bit more optimization I think I can get the blink example down to 100 bytes or so.

One small surprise I found during my testing is that the Electrodragon CH551 and CH552 modules use different pins for the user LED. On the CH551, use P3.0, working in open-drain mode so the LED light up when P3.0 is low. On the CH552, drive P1.4 high to light the LED. This is documented on the Electrodragon web site, but it is easy to forget when switching between the two modules.

I've already started to learn how to configure the standard MCS-51 UART, and have figured out how to directly manipulate the ports using the SFRs (Special Function Registers). Once I've mastered how to program these cheap little devices, I'll follow up with another blog post revealing the details.

Low-speed 1.5Mbps and full-speed 12Mbps USB, while more complicated than a UART, are still hacker-friendly. As the standard approaches 25 years old, I've decided to document some of the more useful highlights I've learned.

While some USB devices will have accessible PCB pads where you can probe signals, it's best to have some breakouts and pass-thru cables with test points. I've found broken micro-USB cables to be cheap option. I cut the micro-b end off, strip the wires, and solder them to some protoboard with 4 pin headers for the ground 5V, D+ and D- connections. A crude USB voltage tester can be made with a couple silicon diodes and white or blue LED in series, powered by the 5V line. In the 20mA range, a 1N4148 has a vF of about 0.8V, so a 3.4V LED will be brightly lit if 5V is present. I've also made a custom USB-A extension cable with a section of the D+ and D- wires exposed for easy attachment of alligator clips.

Although USB power is 5V, typically at up to 500mA, the signalling is 3.3V. At the host, the data pins are pulled to ground with a resistance between 15k and 22k. At the device, the D+(full-speed) or D-(low-speed) pin is pulled up to 3.3V to signal to the host that a device is attached. The spec shows this being done with a 1.5k pullup to 3.6V, which creates a 18.5k/20k divider, resulting in 3.6V * 0.925 or 3.33V. I've found a 10k pullup to 5V works just fine, and many devices use a 1.5k pullup to 3.3V. since the spec requires a minimum of 2.7V for detection to work. For a connected low-speed device (like a mouse), D+ will be near 0V, and D- will be near 3.3V. For a full-speed device, the polarity will be reversed. High-speed devices use low-swing 400mV signalling with both D+ and D- at 0V when idle.

The frequency counter on a multimeter can be used to tell if a device is alive, or if the host has failed to recognize it. For a device that has been enumerated by a host, the host will send a keepalive signal to the device. For a low-speed device, this is a single-ended 0 (SE0) where D- is pulled low for 1.3us every ms. Therefore, a frequency of at least 1kHz will be detected on the D- line.

You can get a USB device to reconnect without unplugging it by forcing a bus reset. This can be done by shorting the D+(full-speed) or D-(low-speed). To avoid releasing the magic smoke by accidentally shorting the wrong connection, I suggest using 100-150 ohm resistor, which is still more than sufficient to reset the bus.

I've written a few bootloaders for AVR MCUs, which necessarily need to modify the flash while running. The typical 4ms to write or erase a page depends on the speed of the internal RC oscillator. Here's a quote from section 6.6.1 of the ATtiny88 datasheet:

Note that this oscillator is used to time EEPROM and Flash write accesses, and the write times will be affected accordingly. If the EEPROM or Flash are written, do not calibrate to more than 8.8 MHz. Otherwise, the EEPROM or Flash write may fail.

I wondered how running the RC oscillator well above 8.8MHz would impact erasing and writing flash In the past I read about tests showing the endurance of AVR flash and EEPROM is many times more than the spec, but I couldn't find any tests done while running the AVR at high speed. I did come across a post from an old grouch on AVRfreaks warning not to do it, so now I had to try.

The result is a program I called flashabuse, which you'll see later is a bit of a misnomer. What the program does is set OSCCAL to 255, then repeatedly erase, verify, write, and verify a page of flash. I chose to test just one page of flash for a couple reasons. First, testing all 128 pages of flash on an ATtiny88 would take much more time. The second is that I would only risk damaging one page, and an ATtiny88 with 127 good pages of flash is still useful.

The results were very positive. My little program was completing about 192 cycles per second, taking 2.6ms for each page erase or page write. I let it run for an hour and a half, so it successfully completed 1 million cycles. Not bad considering Atmel's design specification is a minimum of 10,000 cycles.

So why does the flash work fine at high speed? I think it has to do with how floating-gate flash memory works. Erasing and writing the flash requires removing and adding a charge to the floating gate using high voltages. Atmel likely uses timing margins well in excess of the 10% indicated in the datasheet, so even half the typical 4ms is more than enough to ensure error-free operation. I even think writing at high speed puts less wear on the flash because it exposes the gate to high voltages for a shorter period of time.

Addendum

I received some feedback questioning whether the faster write time may reduce retention due to reduced charge on the floating gate. As I mentioned above, Atmel likely used a very large timing margin when designing the flash memory. Chris Lamont, who tested flash retention on a PIC32, stated that retention failure is "extremely unlikely".

The retention specs for the ATtiny88 are, "20 years at 85°C / 100 years at 25°C". As this Micron technical note (PDF) shows, retention specs are based on models, not actual testing. Micron's JESD47I PCHTDR testing is done at 125C for 1000 hours, and requires 0 failures. TEKMOS states, "As a very rough rule of thumb, the data retention time halves for every 10C rise in temperature." Extrapolating from a 100-year retention at 25C, retention at 255C, a typical reflow soldering peak temperature, would be only 6 minutes.

In an attempt to show that retention is not impacted by repeated fast flashing, I performed two additional tests. For the first test, I baked the subject MCU for 12 hours at 150C, then performed 100,000 fast write/erase cycles. Next, 0x55 was written to the test page, and repeatedly verified for 2 hours. This test passed with no errors. For the second test, I filled the 8kB of flash with zeros to put a charge on the floating gate for every bit. I then baked the subject MCU for 12 hours at 150C, then verified that all bits remained at zero. This test passed with all 65,536 bits reading zero. I did, however have a failure of one solder joint, likely due to the stress of thermal cycling.

For those who are ~~particularly concerned~~ paranoid about flash retention, one solution is refereshing the flash. For an AVR MCU, it would be simple to refesh the flash on every bootup with a small segment of code in .init1. The code would copy each page into the page buffer, then perform a write on the page. This would refresh all the 0 bits, and extend the retention life for another 20 to 100 years.

The AVR reset pin has many functions. In addition to being used as an external reset signal, it can be used for debugWire, and it is used for SPI and for high-voltage programming. Other than for when it is used as an external reset signal, the datasheet specifications are somewhat ambiguous. I recently started working on an updated firmware for the USBasp, and wanted to find out more details about the SPI programming mode. The image above is one of many recordings I made from programming tests of AVR MCUs.

When I first started capturing the programming signals, I observed seemingly random patterns on the MISO line before programming was enabled. Although the datasheet lists the target MISO line as being an output, it only switches to output mode after the first two bytes of the "Programming Enable" instruction, 0xAC 0x53, are received and recognized. Prior to that the pin floats, and the seemingly random patterns I observed were caused by the signals on the MOSI and SCK lines inducing a voltage on the MISO line. I enabled the pullup resistor on the programmer side in order to keep the MISO line high until the PE instruction was recognized by the target.

One of the steps in the datasheet's serial programming alorithm that doesn't make sense to me is step 2, which says, "Wait for at least 20 ms and enable Serial Programming by sending the Programming Enable serial instruction to pin MOSI." It's clear from the capture image above that a wait time of less than 100 us worked in this case. I did a number of experiments with different targets (t13, t85, m8a) with and without the CKDIV8 fuse set, and found a delay of 64 us was always sufficient. Nevertheless, I still used a 20 ms delay in the USBasp firmware.

Another observation I made was of a repeatable delay between the 8th rising edge of the SCK signal on the second byte and MISO going low. After multiple tests, I found that delay is between 2 and 3 of the target clock cyles. A close-up of the 0x53 byte shows this clearly:

The 2-3 clock ccyle delay seems to correspond with the datasheet's specification of the minimum low and high periods for the SCK signal of 2 clock cycles when the target is running at less than 12Mhz. However I found I couldn't consistently get a target running at 8MHz to enter programming mode with a SCK clock of 1.5MHz. Additional logs of the programming sequence revealed something interesting when multiple PE instructions are sent at less than 1/8th of the target clock rate, with a positive pulse on RST for synchronization. In those sequences, the delay was smaller between the 8th rising edge of the SCK signal on the second byte and MISO going low for the second and subsequent times the PE instruction is sent. It seems you need to use a slower SCK frequency to get the target into programming mode, but after that, the frequency can be increased to 1/4 of the target clock.

Using what I learned, I have implemented automatic SCK speed negotiation and a higher default SCK clock speed. The speed negotiation starts with 1.5MHz for SCK, and makes 3 attempts to enter programming mode. If that fails, the next slower speed (750kHz) is tried three times, and so on until a speed is found where the target responds. For subsequent communications with the target, the speed is doubled, since the slowest speed is only needed the first time the PE command is received after power-up. The firmware also supports a maximum SCK frequency of 3MHz, vs 1.5MHz for the original firmware.

The higher speeds don't make a large difference in flash/verify times since the overhead of the vUSB code tends to dominate beyond a SCK frequency of 750kHz or so. Reading the 8kB of flash on an ATtiny85 takes around 3 seconds. By optimizing the low-speed USB code, such as was done by Tim with u-wire, it should be possible to double that speed.

Earlier this year I purchased a EDMINI board from Electrodragon. It uses a LGT8F328P chip, which supports the AVR instruction set. The instruction set timings and peripheral registers vary slightly from the ATmega328P, so it is not 99% compatible as claimed by Electrodragon. I bought one to see just how compatible it is, and possibly to port some of my AVR libraries to the LGT MCU.

The module arrived in an anti-static bag, inside a padded envelope. After connecting 5V power to the board, the D13 LED blinked on and off every second, suggesting that it comes with the Arduino blink sketch pre-loaded. I then hooked up a USB-TTL adapter, installed the LGT board file in the Arduino IDE, and tried flashing a modified blink sketch to the board. The upload failed, and after some debugging I found that the reset was not working on the MCU. Neither pressing and holding the reset button nor grounding RST would reset the board. After contacting Electrodragon, Chao agreed replace the board, with two new boards. He told me that they see a higher than average failure rate with the LGT8F328P chips.

In addition to Chao's frank comment about reliability, another concern I had about the LGT parts was the lack of markings on the chip. I suspect LGT sells the parts without markings so vendors can label them with their own brand. This also makes it easier for more nefarious manufacturers to label them as an ATmega328p.

When the new boards arrived, the first thing I did was make sure the reset button worked. After pressing reset the LED flashes quickly three times for the bootloader, and then flashes on and off every second. However when I tried uploading sketch using the Arduino IDE, the upload still failed. After some more debugging, I found I could upload if I pressed the reset button just before uploading. This meant the bootloader was working, but auto-reset (toggling the DTR line) was not. These boards use the same auto-reset circuit as an Arduino Pro Mini:

A negative pulse on DTR will cause a voltage drop on RST, which is supposed to reset the target. When the target power is 5V and 3V3 TTL signals are used, toggling DTR will cause RST to drop from 5V to about 1.7V (5 - 3.3). With the ATmega328P and most other AVR MCUs, 2V is low enough to reset the chip. The LGT8F328P, however requires a lower voltage to reset. In some situations this can be a good thing, as it means the LGT MCU is less likely to reset due to electromagnetic interference.

The EDMINI board has a 3V3 regulator which can be selected by a solder jumper. This is mentioned on the Electrodragon site, but it is not clearly documented which pads need to be shorted to switch from 5V to 3V3. After a bit of debugging I was able to run the board at 3V3, and was able to use the auto-reset feature.

I do most of my AVR development using command line tools. I compiled a small program that toggles every pin on PORTB, and flashed it to the EDMINI board using avrdude. Nothing happened. Since the Arduino blink sketch worked, I know that the LED on PB5 was working. My conclusion is that the LGT Arduino core must do some setup to enable PORTB. This is common on modern MCUs such as the ARM Cortex, but on AVRs like the ATmega328p, writing 255 to the PORTB and DDRB registers is all it takes to drive every pin on port B high.

I won't be doing any development work with the LGT MCUs. Although they are cheaper and can run a bit faster than authentic AVR parts, their compatibility is rather limited. Any code that relies on the standard AVR instruction set timing, such as my picoUART library, will not work. The 8F328P cannot be programed with a USBasp, as the native programming interface is SWD, not Atmel's SPI-based protocol. For a cheap and powerful MCU, the CH551 looks much more interesting.

For software development, I often prefer to work close to the hardware. Libraries that abstract away the hardware not only use up limited flash memory, they add to the potential sources of bugs in your code. For a basic test of STM32 library bloat, I compiled the buttons example from my TM1638NR library in the Arduino 1.8.13 IDE using stm32duino for a STM32F030 target. The flash required was just over 8kB, or slightly more than half of the 16kB of flash specification on the STM32F030F4P6 MCU. While I wasn't ready to write my own tiny Arduino core for the STM32F, I was determined to find a more efficient way of programming small ARM Cortex-M devices.

After a bit of searching, looking at Bill Westfield's Miimalist ARM project, libopencm3, and other projects, I found most of what I was looking for in a series of STM32 bare metal programming posts by William Ransohoff. However instead of using an ST-Link programmer, I decided to use a standard USB-TTL serial dongle to communicate with the ROM bootloader on the STM32.

To enable the bootloader, the STM32 boot0 pin must be pulled high during power-up. then the bootloader will wait for communication over the USART Tx and Rx lines. On the STM32F030F4P6, the Tx line is PA9, and the Rx line is PA10. In order reset the chip before flashing, I also connected the DTR line from my serial module to NRST (pin 4) on the MCU as shown in the following wiring diagram:

For flashing the MCU, I decided on stm32flash. While installation on Debian Linux is as simple as, "apt install stm32flash", I had some difficulty finding a recent Windows build. So I ended up building it myself. Although my build defaults to 115.2kbps, I found 230.4kbps completely reliable. At 460.8kbps and 500kbps, I encountered intermittent errors, so I stuck with 230.4kbps. After making the necessary connections, and before flashing any code to the MCU, do a test to confirm the MCU is detected.

One thing to note about stm32flash is that it does not detect the amount of flash and RAM on the target MCU. The numbers come from a hard-coded table based on the device ID reported. The official flash size in kB is stored in the system ROM at address 0x1FFFF7CC. On my STM32F030F4P6, the value read from that address is 0x0010, reflecting the spec of 16kB flash for the chip. My testing revealed that it actually has 32kB of usable flash.

I used William's STM32F0 GPIO example as a template to create a tiny blinky example that uses less than 300 bytes of flash. Most of that is for the vector table, which on the Cortex-M0 has 48 entries of 4 bytes each. To save space, I embedded the reset handler in an unused part of the vector table. Since the blinky example doesn't use any interrupts, all but the initial stack pointer at vector 0 and the reset handler at vector 1 could technically be omitted. I plan to re-use the vector table code for other projects, so I did not prune it down to the minimum.

The blinky example will toggle PA9 at a frequency of 1Hz. That is the UART Tx pin on the MCU, which is connected to the Rx pin on the USB-TTL dongle. This means when the example runs, the Rx LED on the USB-TTL dongle will flash on and off.

I think my next step in Cortex-M development will be to experiment with libopencm3. It appears to have a reasonably lightweight abstraction of GPIO and some peripherals, so it should be easier to write code that is portable across multiple different ARM MCUs.

A few months ago, while browsing LCSC, I found a surprisingly cheap ARM M0 MCU. At the time it was 16.6c in single-unit quantities, with no higher-volume pricing listed. From the datasheet LCSC has posted, there was enough information in English to tell that it has 2kB RAM, 16kB flash, and runs up to 32MHz with a 1.8V to 3.6V power supply. Although the part number suggests it may be a clone or is compatible with the STM32F030, it's not. The part number for the STM32F030 clone is HK32F030F4P6.

Some additional searching brought me to some Chinese web sites that advertised the chip as a 32-bit replacement for the STM8S003. The pinout matches the STM8S003F3P6, so in theory it is a drop-in replacement for the 8S003. Unlike the STM32F0, it has no serial bootloader, so programming has to be done via SWD. And with no bootloader support, there's no need to be able to remap the flash from 0x0800000 to 0x0000000 like the STM32. A small change to the linker script should be all it takes to handle that difference. Even though I wasn't sure how or if I'd be able to program the chips, I went ahead and ordered a few of them. I already had some TSSOP20 breakout boards, so the challenge would be in the software, and the programming hardware.

Since I'm cheap, I didn't want to buy a dedicated DAPlink programmer. I have a STM32F103 "blue pill", so I considered converting it to a black magic probe. But since I've been playing with the CH554 series of chips, I decided to try running CMSIS-DAP firmware on a CH552. If you're not familiar with CMSIS-DAP and SWD, I recommend Chris Coleman's blog post. Before I tried it with with the HK32F030MF4P6, I needed to try it with a known good target. Since I had recently been working with a STM32F030, that's what I chose to try first.

The two main alternatives for open-source CMSIS-DAP software for downloading, running, and debugging target firmware are OpenOCD and pyOCD. pyOCD is much simpler to use than OpenOCD; after installing it with pip, 'pyocd list' found my CH552 CMSIS-DAP:

However that's as far as I could get with pyOCD. There seems to be a bug in the CMSIS-DAP firmware or pyOCD around the handling of the DAP_INFO message. Fixing the bug may be a project for another day, but for the time being I decided to figure out how to use OpenOCD.

To use OpenOCD, you need to create a configuration file with information about your debug adapter and target. It's all documented, however it's very complicated given that OpenOCD does a whole lot more than pyOCD. It's also complicated by the fact that since the release of v0.10.0, there have been updates that have made material changes to the configuration file syntax. I had a working configuration file on Windows that wouldn't work on Linux. On Linux I was running OpenOCD v0.10.0-4, but on windows I was running v0.10.0-15. After installing the xPack project OpenOCD build on Linux, the same config file worked on both Linux and Windows, which I named "cmsis-dap.cfg":

adapter driver cmsis-dap

transport select swd
adapter speed 100

swd newdap chip cpu -enable
dap create chip.dap -chain-position chip.cpu
target create chip.cpu cortex_m -dap chip.dap

init
dap info

With dupont jumpers connecting SWCLK, SWDIO, VDD, and VSS on my STM32F030 breakout board, here's the output from openocd.

After making the same connections (factoring the different pinout) to the HK32F030MF4P6, I was getting no response from the MCU. Before connecting, I had done the usual checks for shorts and continuity, making sure all my solder connections were good. Next I tried just connecting VDD and VSS, while I probed each pin. Pin 2, SWDIO, was pulled high to 3V3, as was nRST. All other pins were low, close to 0V. The STM32F030 pulls SWDIO and nRST high too. I tried reconnecting SWDIO and SWCLK, and connecting a line to control nRST. I added "reset_config trst_and_srst" to my config file, and still didn't get a response. Looking at the debug output from openocd (-d flag) shows the target isn't responding to SWD commands:

Debug: 179 99 cmsis_dap_usb.c:728 cmsis_dap_swd_read_process(): SWD ack not OK @ 0 JUNK Debug: 180 99 command.c:626 run_command(): Command 'dap init' failed with error code -4

Since the datasheet says that after reset, pin 2 functions as SWDIO, and pin 11 functions as SWCLK, I'm at a bit of an impasse. I'll try hooking up my oscilloscope to the SWDIO and SWCLK lines to make sure the signals are clean. I've read that in some ARM MCUs, DAP works while the device is in reset, so I'll peruse the openocd docs to figure out how to hold nRST low while communicating with the target. And of course, suggestions are welcome.

Before I finish this post, I wanted to explain the reference to a "ten cent" MCU. LCSC does not list volume pricing for the part, but when I searched for the manufacturer's name, "Shenzhen Hangshun Chip Technology Development", I found an article about the company. In the article, the company president, Liu Jiping, refers to the 10c ($0.1) price. I suspect that pricing is for quantities over 1000. Assuming these chips can actually be programmed with a basic SWD adapter, then even paying 20c for a 20-pin, 32MHz M0 MCU looks like a good deal to me.

After my first look at the HK32F030MF4P6, I wondered if the HK part, unlike the STM32F030 it is modeled after, does not have 5V tolerant IO. I changed the solder jumpers to 3V3 on the CH552 module I'm using as a CMSIS-DAP adapter, which caused it to stop working. This was because the CH552 requires a 5V supply in order to run reliably at 24Mhz. After re-flashing the CMSIS-DAP firmware set to run at 16MHz, the module worked, and I was finally able to talk to the HK MCU via SWD.

In the screen shot above, I chose the stm32f051 target because pyocd does not have the HK MCU nor the STM32F030 among it's builtin targets. For basic SWD communications, the target option is not even necessary. With the target specified, it's possible to specify peripheral registers by name, rather than having to specify a memory address to read or write.

In the screen shot above, I'm using the "connect_mode" option to bring the nRST line low on the target device when entering debug mode. Usually this is not necessary for SWD, however some of the probing I did would cause the MCU to crash. This required a power cycle or reset to restore communications via SWD.

The first tests I did with the HK MCU were to probe the flash and RAM. The HK datasheet shows the flash at address 0. In the STM32F0, the flash is at address 0x8000000, and is mapped to address 0 when the boot0 pin is low. Although the HK MCU doesn't have a boot0 pin, data at address 0x8000000 is mirrored at address 0 as well. What was most unusal about the HK MCU is that the flash was not erased to all 0xFF as is typical with other flash-based MCUs. Most of the flash contents was zeros, except for some data at address 0x400, which was the same on the 2 MCUs I checked:

By writing to memory starting at 0x20000000 using the 'ww' command, I discovered that the MCUs I received have 4kB or RAM, rather than the 2kB specified in the datasheet. Writing to 0x20001000 (beyond 4kB) results in a crash.

For writing and erasing the flash, I initially tried using the pyOCD 'erase' and 'flash' commands. Since the MCU flash interface is not part of Cortex-M specification, the flash interface peripheral will vary from one MCU vendor to the next. The flash interface on the STM32F051 is almost identical to the flash interface on the STM32F030, however the 'erase' and 'flash' commands caused the HK MCU to crash when I ran them. Testing on a genuine STM32F030 crashed as well, and after some debugging and reading through the pyOCD code, I realized the STM32F051 flash routines need 8kB of RAM. Even after downloading and installing the STM32F0 device pack, I could not erase or flash the HK MCU.

Next I reviewed the STM32F030 programming manual, and tried to access the flash peripheral registers directly. This was when I found a pyOCD bug with the wreg command. I was able to unlock the flash by writing the magic sequence of 0x45670123 followed by 0xCDEF89AB to flash.keyr. I tried erasing the first page at address 0, and although flash.sr and flash.cr updated as expected, the memory contents did not change. What did work was erasing the page at address 0x8000000, which cleared the contents at address 0 as well. I still find it strange that the erase operation sets all bits to 0 instead of 1. The HK datasheet says a flash page is 128 bytes, and erasing a page resulted in 128 bytes set to all zero.

I was only partially successful in writing data to the flash. Writing to 0x8000000 did not work, however writing a 16-bits to address 0 using the 'wh' command was successful. Trying to write 16-bits to address 2 updated the flash.ar and flash.sr as expected, but did not change the data. Writing to any 4-byte aligned address in the erased page worked, but writing to addresses that were only 2-byte aligned left all 16 bits at zero. I tried writing bytes with 'wb' and full words with 'ww', both of which crashed the MCU, likely from a hard fault interrrupt. I even made sure there isn't a bug with the 'wh' command by writing 16-bits at a time to RAM.

While searching the CHK website for more documentation, I found a page with IAR device packs. Although pyOCD uses Kiel device packs, I downloaded the HK32F0 pack, which is a self-extracting RAR file, which saves the uncompressed files in AppData\Local\Temp\RarSFX0.

Since .pack files are just zip files with a different extension, I zipped the files back up as a .pack file. However pyOCD couldn't read it: "0000731:CRITICAL:__main__:CMSIS-Pack './HK32F0.pack' is missing a .pdsc file". Manually examining the files confirmed some of my earlier discoveries, such as flash at address 0x8000000, remapped to address zero. I found a file named HK32F030M.svd, which contains XML definitions of the peripheral registers. pyOCD's builtin devices appear to use svd files, so it may be possible to add HKD32F0 support to pyOCD.

Copies of the IAR support pack, datasheet, and pyocd page erase sequence can be found in my github repository.

On my last LCSC order, I bought a few GD32E230 chips, specifically the GD32E230K8T6. I chose the LQFP parts since I have lots of QFP32 breakout boards that I've used for other QFP32 parts. Gigadevice is much better than many other Chinese MCU manufacturers when it comes to providing English documents. After my past endeavors trying to understand datasheets from WCH and CHK, going through the Gigadevice documentation was rather pleasant.

Although Gigadevice makes no mention of any STM32 compatibility, but the first clue is the matching pinouts of the STM32F030 and GD32E230. To prepare for testing, I tinned the pads on a couple of breakout boards, applied some flux, and laid the chips on the pads. I laid the modules on a cast-iron skillet, and heated it up to about 240C. The solder reflowed well, however I noticed some browning of the white silkscreen. Next time I'll limit the temperature to 220C. After testing for continuity and fixing a solder bridge, I was ready to try SWD. I connected 3.3V power and the SWD lines, and ran "pyocd cmd -v":

0000710:INFO:board:Target type is cortex_m
0000734:INFO:dap:DP IDR = 0x0bf11477 (v1 MINDP rev0)
0000759:INFO:ap:AHB5-AP#0 IDR = 0x04770025 (AHB5-AP var2 rev0)
0000799:INFO:rom_table:AHB5-AP#0 Class 0x1 ROM table #0 @ 0xe00ff000 (designer=4 3b part=4cb)
0000812:INFO:rom_table:[0]<e000e000:SCS-M23 class=9 designer=43b part=d20 devtyp e=00 archid=2a04 devid=0:0:0>
0000823:INFO:rom_table:[1]<e0001000:DWT class=9 designer=43b part=d20 devtype=00 archid=1a02 devid=0:0:0>
0000841:INFO:rom_table:[2]<e0002000:BPU class=9 designer=43b part=d20 devtype=00 archid=1a03 devid=0:0:0>
0000848:INFO:cortex_m_v8m:CPU core #0 is Cortex-M23 r1p0
0000859:INFO:dwt:2 hardware watchpoints
0000866:INFO:fpb:4 hardware breakpoints, 0 literal comparators

I did little probing around the chip memory. The GD32E23x user manual shows SRAM at 0x20000000, like STM32 parts. The contents looked like random values, which I could overwrite using the pyocd "ww' command. Writing to 0x20002000 resulted in a memory fault, indicating the part does not have any "bonus" RAM beyond 8kB.

Next, I tried using the built-in serial bootloader. After connecting BOOT0 to VDD and connecting power, PA9 and PA10 were pulled high, indicative of the UART being activated. However my first attempt at using stm32flash was not successful:

After attaching my oscilloscope, and writing a small bootloader protocol test program, I was able to determine that the responses did seem to conform to the STM32 bootloader protocol. I did notice that the baud rate from the GD32E230 was only 110kbps, so it wasn't perfectly matching the 115.2kbps speed of the 0x7F byte sent for baud rate detection. To avoid the potential for data corruption, I switched to 57.6kbps. Before resorting to debugging the source for stm32flash, my test of stm32loader gave better results:

$ stm32loader -V -p com39

Open port com39, baud 115200

Activating bootloader (select UART)

*** Command: Get

Bootloader version: 0x10

Available commands: 0x0, 0x2, 0x11, 0x21, 0x31, 0x43, 0x63, 0x73, 0x82, 0x92, 0x6

Bootloader version: 0x10

*** Command: Get ID

Chip id: 0x440 (STM32F030x8)

Supply -f [family] to see flash size and device UID, e.g: -f F1

Next, I was ready to try flashing a basic program. I first checked for GD32E support in libopencm3. No luck. Then as I read through the user manual, I noticed GPIOA starts at 0x4800 0000 on AHB2, the same as STM32F0 devices. The register names didn't match the STM32, but the function and offsets were the same. For example on the GD32E, the register to clear individual GPIOA bits is called GPIOA_BC, rather than GPIOA_BRR as it is called on the STM32. The clock control registers, called RCU on the GD32E, also matched the STM32 RCC registers. Since it was looking STM32F0 compatible, I tried flashing my blink example with stm32loader, and it worked!

The LED was flashing faster than it did with the STM32F030. A little searching revealed that the ARM Cortex-M23, like the M0+, has a 2-stage pipeline. The STM32F030 with it's M0 core has a 3-stage pipeline. My delay busy loop needs to be four cycles per iteration, and on the M23, the bne instruction only takes two cycles. My solution is adding a nop instruction based on an optional compile flag.

One problem I have yet to resolve with the GD32E is support for the bootloader Go/0x21 command. With the STM32F0, I left BOOT0 high, and used DTR to toggle nRST before uploading new code. The stm32flash "-g 0" option made the target run the uploaded code after flashing was complete. I went back to debugging stm32flash, and discovered that it is hard-coded to use the "Get Version"/0x01 command, and silently fails if the bootloader responds with a NAK. After a few mods to the source, I was able to build a version that works with the GD32E230, however the Go command still doesn't work. Perhaps a task for a later date will be to hook up a debug probe to see what the E230 is doing when it gets the Go command.

Overall, I'm quite happy with the GD32E230K8T6. They cost less than half the equivalent STM32 parts, and are even cheaper than other Chinese STM32 clones I've seen. They are lower power and their maximum clock speed is 50% faster than the STM32F0. In addition to the shorter 2-stage pipeline, the GD32E devices support single-cycle IO, making them faster for bit-banged communications than the STM32F0 which takes 2 cycles to write to a GPIO pin. The GD32E230 also has some new features, which might be worth discussing in a future blog post.

Over the past several months, I've been been learning to use the CH551 and CH552 MCUs. Learning generic 8051 programming was the easy part, as there is lots of old documentation available, with Philips having written some of the best. The learning curve for WCH's additions to the MCS-51 architecture has been steeper, requiring careful reading of the datasheets, and reading the SDK headers and examples. I've found that the CH55x chips have some quirks that I've never encountered on any other MCUs.

The GPIO modes are controlled by two registers: MOD_OC and DIR_PU. The register values are explained in the datasheet and in ch554.h in the SDK. Figure 10.2.1 in the datasheet shows a schematic diagram for the GPIO. Modes 0, 1, and 2 are for high-Z input, push-pull, and open-drain respectively. Mode 3, "standard 8051 mode" is the most complicated. It's an open drain mode with internal pullup, but with the output driven high for two cycles when the GPIO changes from a 0 to a 1. This ensures a fast signal rise time. The part that took me the longest to figure out was the operation of the pullup. The GPIO diagram shows 70k and 10k, but section 10 of the datasheet does not explain their operation. Therefore I've highlighted a part of the schematic in green. When the pin input schmitt trigger output is 1, the inverter in the top right of the diagram will output a low signal to turn on the pFET activating the 10k pullup. When port input value is 0, only the weak 70k pullup is active.

The pullups aren't actually implemented as resistors on the IC. They are specially-designed FETs with a high drain-source resistance (RDS). Since RDS varies with gate-source voltage (Vgs), the pullup resistance will vary inversely with Vcc. Using a 5V supply, the pullup resistance will be close to the 70k shown in the schematic. Using a 3.3V supply, the pullup resistance is close to 125k. Although it is not obvious, this information can be found in section 18 of the datasheet, with the specifications for IUP5 and IUP3. These numbers are the amount of current a grounded pin will source when the pullup is enabled.

The reset pin has an internal pulldown, which seems to be weak like the GPIO pullups. At times when working with a CH552 running at 3V3, the chip reset when I inadvertently touched the RST pin with my finger. This was easily solved by keeping the RST pin shorted to ground.

The last issue I encountered is more of a documentation issue than a quirk. The maximum reliable clock speed of an IC is depended on the supply voltage. All of the AVR MCUs I've worked with have a graph in the datasheet showing the voltage required to ensure safe operation at a given speed. For the CH55x MCUs, there is a subtle difference in the electrical specs at section 18 of the datasheet. At 5V, total supply current at 24MHz is specified, whereas the specs for 3.3V specify total operating current at 16Mhz. When I tried running a CH552T at 24MHz with a 3.3V supply, it never worked. The same part worked perfectly at 16MHz.

Despite the quirks, I think the CH55x MCUs are still a good value. Current quantity 10 pricing at LCSC is 36c for the CH552T, and 26c for the CH551G. I recently purchased a small tube of the CH552T, and have plans to test the touch, ADC, PWM, and SPI peripherals.

Over the last several months, I've been familiarizing myself with the CH552 and CH551 MCUs. Most recently, I've been learning how to program the USB serial interface engine on these devices. The USB interface is powerful and flexible enough to implement many different kinds of USB devices, from HID to CDC serial. The highlights are:

support for endpoints 0 through 4, both IN and OUT
64-byte maximum packet size
DMA to/from xram only
multiple USB interrupt triggers

One of the first requirements for writing USB firmware is writing the descriptors. The examples from WCH are difficult to use as a template due to the descriptors being uint8_t arrays instead of structures. There are USB structure and constant definitions in ch554_usb.h, which I recommend using instead of arrays. For instance, I changed the CDC serial example from :

__code uint8_t DevDesc[] = {0x12,0x01,0x10,0x01,0x02,0x00,0x00,DEFAULT_ENDP0_SIZE,

0x86,0x1a,0x22,0x57,0x00,0x01,0x01,0x02,

0x03,0x01

};

to:

__code USB_DEV_DESCR DevDesc = {

.bLength = 18,

.bDescriptorType = USB_DESCR_TYP_DEVICE,

.bcdUSBH = 0x01, .bcdUSBL = 0x10,

.bDeviceClass = USB_DEV_CLASS_COMMUNIC,

.bDeviceSubClass = 0,

.bDeviceProtocol = 0,

.bMaxPacketSize0 = DEFAULT_ENDP0_SIZE,

.idVendorH = 0x1a, .idVendorL = 0x86,

.idProductH = 0x57, .idProductL = 0x22,

.bcdDeviceH = 0x01, .bcdDeviceL = 0x00,

.iManufacturer = 1, // string descriptors

.iProduct = 2,

.iSerialNumber = 0,

.bNumConfigurations = 1

};

Once the descriptors are written, the code to handle device enumeration is mostly boilerplate and can be copied from one of the examples. During the firmware development stage, I recommend adding a call to disconnectUSB() near the start of main(). It's a function I added to debug.h which forces the host to re-enumerate the device. This way I don't have to unplug and re-connect the USB module after flashing new firmware.

Setting up the DMA buffer pointers requires special attention when multiple IN and OUT endpoints are used. Even though five endpoints are supported, there are only four DMA buffer pointer registers: UEP[0-3]_DMA. When the bits bUEP4_RX_EN and bUEP4_TX_EN are set in the UEP4_1_MOD SFR, the EP4 OUT buffer is UEP0_DMA + 64, and the EP4 IN buffer is UEP0_DMA + 128. Endpoints 1-3 have even more complex buffer configurations, with optional double-buffering for IN and OUT using 256 bytes for four buffers starting from the UEPn_DMA pointer.

When I first started writing USB firmware for the CH551 and CH552, I was concerned that it may be difficult to meet the tight timing requirements, particularly for control and bulk packets that can have multiple in a single 1ms frame. For example, with small data packets, the time between the end of one OUT transfer and the end of the next OUT transfer can be less than 20uS. If the USB interrupt handler is too slow, the 2nd OUT transfer could overwrite the DMA buffer before processing of the first has completed. This situation is avoided by setting bUC_INT_BUSY in the USB_CTRL SFR. When this bit is set, the SIE will NAK any packets while the UIF_TRANSFER flag is set. Therefore I recommend setting bUC_INT_BUSY, and clear UIF_TRANSFER at the end of the interrupt handler.

I am currently working on the CMSIS_DAP example. It implements the DAPv1 (HID) protocol supporting SWD transfers, and works well with OpenOCD and pyOCD. I'm working on adding CDC/ACM for serial UART communication. The first step is creating the descriptors for the composite CDC + HID device. The second step will be integrating the usb_device_cdc code. The final step, although not absolutely necessary, will be optimizing the CDC code for baud rates up to 1mbps. The current code uses transmit and receive ring buffers with data copied to and from the IN and OUT DMA buffers. With double-buffering, the transmit and receive ring buffers can be omitted. The UART interrupt will copy directly between SBUF and the appropriate USB DMA buffer.

One of my gripes about the Arduino AVR core is that it is not an example of efficient embedded programming. One of the foundations of C++ (PDF) is zero-overhead abstractions, yet the Arduino core has a very significant overhead. The Arduino basic blink example compiles to almost 1kB, with most of that space taken up by code that is never used. Rewriting the AVR core is a task I'm not ready to tackle, but after writing picoCore, I realized I could use many of the same optimization techniques in an Arduino library. The result is ArduinoShrink, a library that can dramatically reduce the compiled size of Arduino projects. In this post I'll explain some of the techniques I used to achieve the coding trifecta of faster, better, and smaller.

The Arduino core is actually a static library that is linked with the project code. As Eli explains in this post on static linking, libraries like libc usually have only one function per .o in order to avoid linking in unnecessary code. The Arduino doesn't use that kind of modular approach, however by making use of gcc's "-ffunction-sections" option, it does mitigate the amount of code bloat due to the non-modular approach.

With ArduinoShrink, I wrote more modular, self-contained code. For example, the Arduino delay() function calls micros(), which relies on the 32-bit timer0 interrupt overflow counter. I simplified the delay function so that it only needs the 8-bit timer value. If the user code never calls micros() or millis(), the timer0 ISR code never gets linked in. By using a more efficient algorithm and writing the code in AVR assembler, I reduced the size of the delay function to 12 instructions taking 24 bytes of flash.

In order to minimize code size and maximize speed, almost half of the code is in AVR assembler. Despite improvements in compiler optimization techniques over the past decades, on architectures like the AVR I can almost always write better assembler code than what the compiler generates. That's especially true for interrupt service routines, such as the timer0 interrupt used to maintain the counters for millis() and micros(). My assembler version of the interrupt uses only 56 bytes of flash, and is faster than the Arduino ISR written in C.

One part that is still written in C is the digitalWrite() function. The Arduino core uses a set of tables in flash to map a given pin number to an IO port and bit, making for a lot of code to have digitalWrite(13, LOW) clear PORTB5. Making use of Bill's discovery that these flash memory table lookups can be resolved at compile time, digitalWrite(13, LOW) compiles to a single instruction: "cbi PORTB, 5".

ArduinoShrink is also designed to significantly reduce interrupt latency. The original timer0 interrupt takes around 5us to run, during which time any other interrupts are delayed. The first instruction in my ISR is 'sei', which allows other interrupts to run, reducing the latency impact to a few cycles more than the hardware minimum. The official Arduino core disables interrupts in several places, such as when reading the millis counter. My solution is to detect if the millis counter has been updated and re-read it, thereby avoiding any interrupt latency impact.

The only limitation compared to the official AVR core is that the compiler must be able to resolve the pin number for the digital IO functions at compile time. Although the pin may hard-coded, even with LTO enabled, avr-gcc is not always able to recognize the pin is a compile-time constant. Since AVR is not a priority target for GCC optimizations, I can't rely on compiler improvements to resolve this limitation. Therefore I plan to write a version of digitalWrite that is much smaller and faster, even when avr-gcc can't figure out the pin at compile time.

Although ArduinoShrink should be compatible with any Arduino sketch, given some of the compiler tricks I've used it's not unlikely I've missed a potential error. If you do find what you think is a bug, open an issue in the github repository.

I love my Pi Zeros. I think every hacker should have one in their toolbox. When I got my firs Pi Zero several years ago, I used a USB-TTL serial adapter to connect to the console UART on pins 8 and 10 of the Pi header. Once I learned how to setup the Zero as an ethernet gadget, things were a bit easier. However updating software was still a cumbersome process of downloading files to the host computer and then using scp to transfer them to the Pi. This blog post documents how to setup the Pi to use a SSH reverse proxy so utilities like git and apt work.

When I got my first Pi Zero, I chose the Pi OS Lite image. I decided to update to the March 4, 2021 release, and this time I used the Pi OS with desktop because it includes development tools like git. I followed the ethernet gadget setup instructions, modifying config.txt, cmdline.txt, and creating an empty file called "ssh". The next step is to configure the multicast DNS component of Zeroconf. As mentioned in the Adafruit instructions, if you are using Windows, the easiest way to do this is installing Apple's Bonjour service.

To use a reverse proxy over ssh, Windows users can't use putty as that feature is not supported. OpenSSH supports reverse socks5 proxies as of version 7.6. For connecting from Windows, I installed MSYS2, including OpenSSH 8.4. On Windows 10, WSL is probably the easiest option. To connect to the Pi and enable a reverse socks5 proxy on port 1080, enter, "ssh -R 1080 pi@raspberrypi.local".

Once connected to the Pi, set "http_proxy" to "socks5h://localhost:1080". The "h" at the end is important as it means the client will do hostname (DNS) resolution through the proxy. I added the following line to .profile to set it every time I login:

export http_proxy="socks5h://localhost:1080"

Programs such as git and curl will automatically use the socks proxy when the http_proxy environment variable is set. Note that github defaults to showing https URLs for repositories, which need to be changed to "http://" for the proxy to work.

The last configuration I recommend is setting the current date, since the Pi does not have a battery-backed RTC. I normally use ntpdate from the ntp project for manually setting the date and time on Linux, but it does not work with a socks proxy. After some searching I found a suggestion of using the HTTP Date: field from a reliable internet server. The command I use is:

date -s "`curl -sI google.com | grep "^Date:" | cut -d'' -f3-7`"

Once the Pi Zero is configured and has the proper date and time set, I recommend running "apt update". If everything is working properly, it will use the socks5 reverse proxy to connect to the raspbian servers and update the local apt repository cache.

One-wire shift register control systems are an old idea, with the benefit of saving an IO pin at the cost of usually much slower speed than standard SPI. I'm a bit of a speed nut, so I decided to see how fast I could make a 1-wire shift system.

The maximum speed of 1-wire shift control systems is limited by the charge time of the resistor-capacitor network used. The well-known RC constant is the resistance in Ohms times the capacitance in Farads, giving the time in seconds to reach 63.2% charge or discharge. To determine the discharge of a capacitor at an arbitrary time, look at a graph for (1/e)^x:

In a 1-wire shift system, the RC network must discharge less than 50% in order to transmit a 1, and it must discharge more than 50% in order to transmit a 0. That 50% threshold is 0.7*R*C. The hysteresis for the shift register and the system error margin will determine how far from 50% those thresholds must be, and therefore the difference between the low times for transmitting a 1 or 0 bit. The SN74HC595 datasheet indicates a typical margin of about 0.05*Vcc, so an input high must be more than 0.55*Vcc, and input low must be less than 0.45*Vcc. After writing some prototype code, and a bunch of math, I settled on an order of magnitude difference between the two. That means in an ideal setup, a transmitted one discharges the RC network by 16.5%, and a zero discharges the network by 83.5%. That gives a rather comfortable margin of error, and it does not entail significant speed compromises.

Half of the timing considerations in previous 1-wire shift systems is the discharge time, and the other half is the charge time. That is because after transmitting a bit, the RC network needs to charge back up close to the high value. In order to eliminate the charge delay, I simply added a diode to the RC network as shown in the schematic. If you are thinking I forgot the "C" of the RC network, you are mistaken. Using the knowledge I gained in Parasitic capacitance of AVR MCU pins and Using a 74HC595 as a 74HC164 shift register, I saved a component in my design by using parasitic capacitance in the circuit.

I'm using a silicon diode that has a capacitance of 4pF. The total capacitance including the 74HC595, the diode, and the resistor on a breadboard is about 13pF. A permanent circuit with the components soldered on a PCB would likely be around 10pF. The circuit is designed for AVRs running at 8-16Mhz, so the shortest discharge period for a transmitted zero would be 10 cycles at 16Mhz, or 625ns. With R*C = 330ns, a 625ns discharge would be 1.9*RC, and the discharge fraction would be 1/e^1.9, or 0.1496. The discharge time for a transmitted one would be 62.5ns, and the fraction would be 1/e^0.189, or 0.8278. Considering the diode forward voltage drop keeps the circuit from instantly charging to 100%, the optimal resistor value when running at 16Mhz would be close to 47K Ohm. In my testing with a tiny13 running at 9.3Mhz on 3.3V, the circuit worked with as little as 12K Ohm and as much as 110K Ohm. The "sweet spot" was around 36K Ohm, hence my use of 33K in the schematic above.

For debugging with my scope, I needed to count the probe capacitance of ~12pF, which gives a total capacitance of 25pF. Here's a screen shot using a 22K Ohm resistor, transmitting 0xAA:

I've posted example code on github. It uses the shiftOne function for software PWM, creating a LED fade effect on all 8 outputs of the shift register. I tested it with MicroCore, and the shiftOne function can be copied verbatim and used with avr-gcc. Since it uses direct port access instead of Arduino's slow digitalWrite, the references to PORTB & PINB will need to be changed in order to use a pin on a different port.

support for endpoints 0 through 4, both IN and OUT
64-byte maximum packet size
DMA to/from xram only
multiple USB interrupt triggers

__code uint8_t DevDesc[] = {0x12,0x01,0x10,0x01,0x02,0x00,0x00,DEFAULT_ENDP0_SIZE,

0x86,0x1a,0x22,0x57,0x00,0x01,0x01,0x02,

0x03,0x01

};

to:

__code USB_DEV_DESCR DevDesc = {

.bLength = 18,

.bDescriptorType = USB_DESCR_TYP_DEVICE,

.bcdUSBH = 0x01, .bcdUSBL = 0x10,

.bDeviceClass = USB_DEV_CLASS_COMMUNIC,

.bDeviceSubClass = 0,

.bDeviceProtocol = 0,

.bMaxPacketSize0 = DEFAULT_ENDP0_SIZE,

.idVendorH = 0x1a, .idVendorL = 0x86,

.idProductH = 0x57, .idProductL = 0x22,

.bcdDeviceH = 0x01, .bcdDeviceL = 0x00,

.iManufacturer = 1, // string descriptors

.iProduct = 2,

.iSerialNumber = 0,

.bNumConfigurations = 1

};

There are two common ways of wiring solar PV arrays. Each panel can be connected to a microinverter, with each microinverter connected in parallel to an AC bus. Alternatively, panels can be connected in series, with one or more DC strings connected to an inverter. Although there is debate over which design is best, at Solar Si, we prefer string inverters. This is an analysis of DC wiring losses with an array of 8 72-cell LONGi PV modules of about 450 Watts each.

There are two sources of wiring resistance in the array. The first is from the wire itself, and the second is from the connectors. The 12 AWG wire used for the panel output cables has a resistance of 5.2 mOhm/m. The MC4 connectors are specified to have a contact resistance of less than 0.5 mOhm. While this may be the resistance when tested in a clean and dry factory, test results in warm and humid conditions show a much higher resistance. Reliability Model Development for Photovoltaic Connector Lifetime Prediction Capabilities indicate resistance in the field is likely to be around 2.5 mOhm.

For the string array, the panels are arranged in the portrait configuration, with the inverter situated 1m from the array. The panels are about 1.06 m wide, making the length of the array 8.5 m. Each panel has a 20cm and a 40cm negative and positive output cable. Unlike the 12 AWG wire used for the PV panel output cables, in Canada, field wiring for PV strings is almost always done with 10 AWG RPVU wire. This has a resistance of 3.28 mOhm/m, and a total of 10.5 m are used for the array.

With 8 panels, there are 7 connections between panels, plus two connections at the ends mating with the RPVU wire. The DC connections on the inverter are usually not MC4, but for simplicity the resistance is assumed to be the same. Adding the positive and negative connections connections to the inverter, the total comes to 11. Here's the calculations for the total resistance:

10.5 m * 3.28 mOhm/m = 34.4 mOhm
12 AWG 0.6 m panel cables * 8 = 4.8m, * 5.2 = 25 mOhm
11 contacts/string * 2.5 mOhm = 27.5 mOhm
total: 86.9 mOhm

For the microinverter array, the optional 1.4 m PV panel output cables will be needed in order for the cables to reach the corresponding microinverter. This increased the total length of 12 AWG wire to 22.4 m. Here's the calculations for the total resistance:

12 AWG 2.8 m panel cables * 8 = 22.4 m, * 5.2 = 116 mOhm
16 contacts * 2.5 mOhm = 40 mOhm
total: 156 mOhm

Although the microinverter configuration higher resistance losses, they are not significant. During peak power output, DC current is about 10 Amps. Using P = I^2 * R, power losses are around 0.5%. Most of the time the array output current is much less than 10 Amps, so the average power loss is much lower. There are additional losses from the AC bus connectors, which are also not significant.

In conclusion, power losses are higher with microinverters than string inverters, but they are not significant. The justification for choosing string inverters lies more with the cost savings in material and labor. For an array with 16 panels, the cost of a 6 kW inverter with 2 string inputs is less than half the cost of 16 Enphase IQ7A microinverters.

KSTAR New Energy makes single phase grid-tied inverters ranging from 1 kW to 10 kW. I tested a 3000S, a 5000D, and a 6000D that were produced in KSTAR's factory outside of Shenzhen. Their single phase inverters are marketed for locations with a 230 V line to neutral (L-N) grid. They also work with the split phase 240 V line to line grid that is typical in the US and Canada. They do not have UL 1741 certification, so they would require special engineering approval to be used for permanent installations with most US and Canada power utilities.

Residential inverters used in the US and Canada usually have an attached junction box with terminal connections for DC and AC wiring. In the rest of the world, inverters usually have MC4 connectors for the DC string input, and a watertight three-pin plug connection for the AC output. It is much more convenient having the plug connections when testing inverters and PV panels. It also avoids potential electrical code concerns when DC wiring up to 600 V and 240 Vac are in the same junction box.

The KSTAR inverters all included MC4 crimp connectors for terminating the DC strings. The AC connector will accept SOOW or SJOW cable with a outside diameter of up to 16 mm. I used 3-wire 12 AWG SOOW cable that is rated for up to 25 Amps.

The 3000S has a single string input, and a "nominal" output power of 3 kW. It is a light inverter, with a stated weight of 8 kg. Out of the box, the measured weight was 7.3 kg. The light weight makes it very easy for a single person to install. When hooked up to a test string of 10 72-cell panels, the efficiency was 85-86%. This is much lower than the spec efficiency of 97% or the 96% efficiency at nominal 380 V listed on the inspection and test sheet that was included with the inverter. With input power of 3070 W and input voltage of 367.7 V, the output power was 2620 W, for an efficiency of 85.3%. KSTAR sales and engineering were unable to explain the low efficiency.

The 5000D and 6000D have the same external dimensions and connections on the bottom. The weight of the 5000D is 11.74 kg, while the 6000D weighs 12.48 kg. This suggests the 6000D has different internal circuitry, likely larger inductors and capacitors, to support the higher power rating.

The efficiency of the 5000D and 6000D inverters ranged between 89 and 91%. The screenshot of monitor data below shows a total input power of 6240 W with AC output power of 5570 W, for an efficiency of 89.3%. This test was done with a large difference between the PV1 and PV2 voltages to represent typical residential PV installations which are not optimized for the inverter's 380 nominal string voltage.

The KStar inverters are reasonably priced and easy to install, but the low efficiency makes them unattractive compared to Growatt and Ginlong Solis inverters.

Better asserts in C with link-time optimization

Measuring AVR interrupt latency

Interrupt Response Time

Results

A full-duplex tiny AVR software UART

Getting started with the WCH CH551 and CH552

Hacker's Intro to USB hardware

Flashing AVRs at high speed

Addendum

Recording the Reset Pin

LGT8F328P EDMINI board

STM32 Starting Small

Trying to test a "ten cent" tiny ARM-M0 MCU

Trying to test a "ten cent" tiny ARM-M0 MCU part 2

GD32E230: a better STM32F0?

Quirks of the CH55x MCUs

Writing USB firmware on the CH55x MCUs

Honey, I shrunk the Arduino core!

Pi ethernet gadget with reverse SSH proxy

Fast 1-wire shift register control

Writing USB firmware on the CH55x MCUs

DC Wiring Losses in String and Microinverter Solar PV Arrays

KSTAR Single Phase String Inverters