Stretch
Documentation
Usage
Papers
- A Reconfigurable Hardware Application for Machining of Metal Parts
- Acceleration of a 3D Target Tracking Algorithm Using an Application Specific Instruction Set Processor
- Processor Customization for Wearable Bio-monitoring Platforms
Setup and Configuration
Starting Stretch
- Set up the environment.
- Make sure you are using a bash shell: just type "bash"
source /usr/local/bin/Stretch_src
- Copy the example directory to where you want
cp /usr/local/Stretch/Examples <where you want>
- Start the ide
st-ide &
Compiling
When you compile your StretchC file (extension .xc) the compiler automatically creates a header file that ***must be included in your c code.*** The header file name is defined in Project -> Project Properties... -> Compiler Options tab -> EI Header.
In summary:
1) set the header file name.
2) include it in the c code. #include "givenName.h"
3) compile your stretchC file
4) run stretchC code by hitting debug
Connecting to the board
Setup a connection type Board Type: S5530, S55DB-ddr
Board IP address: 192.168.1.135
PC IP address:
Board MAC address:00:1A:3B:3C:D0:B3 - Set in code
PC MAC address: 00:1b:21:23:33:54
To use IP/UDP for stretch a static ARP route must be defined in the ARP table.
Check the ARP table in linux with the command
/sbin/arp
Devices
Descriptions of each device and how they operate along with source code can be placed here.
Ethernet over PPI
In the example directory for stretch, there is a MAC loopback example under ../Examples/system/ppi/. You can open that up, turn off the loopback mode by changing the call ppi_init(mode,1) to ppi(mode,0) which turns off the loopback mode.
Further down you will see comments about scheduling a receive and scheduling a transmission. The buffers tx_buf and rx_buf are used.
To send data from the computer, send it to the boards IP address and in stretch C you should have this data in rx_buf. rx_buf will contain the Ethernet layer packet starting with the destination and source MAC, then the IP header and then the UDP header followed by the payload.
To send data to the PC, you will have to construct a similar packet in tx_buf.
To send the data from the host computer to the Stretch board we used Stretch_sendpic.c which was compiled using gcc. The syntax for using this program is ..... This allowed us to send an image from the desktop computer to the Strectch board.
For the Stretch project, we modified the ppiExample program so that it would receive an image buffer it and retransmit that image Stretch_ppiExample.c.
Example
We tried to send an image to the stretch board, copy that image from the receive buffer, rx_buf, into a buffer in memory and then once the entire image was transmitted, copy from the main memory buffer back to the tx_buf and transmit that to the host computer. The results can be seen below.
Since Wikimedia could not display the original .ppm files, they have been linked here. The images seen on the left are the .ppm image converted to the .png filetype.
Lena512.ppm
Cacheflushed.ppm
Noflush.ppm
Even after adding the cache flush line to the source code, the final image does not quite match the original even before any algorithm was applied to the image on the Stretch board. Since every packet sent is received and no extra packets are received by the Stretch board and likewise when the image is transmitted from the Stretch board to the host computer this leads us to believe that the problem is in the handling of the main memory buffer.
UART
asdgasdg
Stretch Instructions
The Stretch instructions make this platform unique, but they require a special format....
Requirements
asdg
Limitations
- Loops: Must be completely unroll-able at compile time
- 4096 Arithmetic Unit, used for arithmetic and logic operators
- 8192 Multiply Unit, used for multiplication and shifting
Calculating AU and MU usage
C Operators | AU | MU |
---|---|---|
A * B | 0 | |A| * |B| |
A (+, -) B | Max(|A|, |B|) | 0 |
A (<<, >>) B | 0 | |A| * 2|B| |
A (<<, >>) constant | 0 | 0 |
A (<, <=, >, >=, ==, !=) B | Max(|A|, |B|) | 0 |
A (&, ^, |) B | Max(|A|, |B|) | 0 |
A (&&, ||) B | |A| + |B| | 0 |
A (++, --) | |A| | 0 |
cond ? B : C | Max(|B|, |C|) | 0 |
cond ? B + C : B - C | Max(|B|, |C|) | 0 |
cond ? B(±)C : B+const | Max(|B|, |C|, |const|) | 0 |
cond? B+const1: B+const2 | Max(|B|, |const1|, |const2|) | 0 |
TABLE[X] | (2n-1 * n * m)/3 | 0 |
A (const) | 0 | 0 |
constant bit extract | 0 | 0 |
A (const0, const1) constant bit-range extract | 0 | 0 |
A (x) variable bit extract | 0 | |A| * |A| |
A (x, y) variable bit-range extract | 0 | |A| * |A| |
Syntax
Single instruction
Syntax:
SE_FUNC void INSTR_NAME(<arguments>) {...}
Example:
SE_FUNC void SimilarFuncs(SE_INST F1, SE_INST F2, <arguments>) { ... // lots of shared code - reused resources x = F1 ? a : b; // minor difference between F1 and F2 ... // more shared code - reused resources }
Multiple instructions
Syntax:
SE_FUNC void func_name(SE_INST INSTR_NAME1, SE_INST INSTR_NAME2, ... <arguments>) {...}
Example:
SE_FUNC void DisjointFuncs(SE_INST F1, SE_INST F2, <arguments>) { if (F1) { ... } // some code here - maybe some reused resources else { ... } // very different code here }
Example
asdg
Sobel Project
Here is the optimized code that I used in the Stretch IDE to implement the Sobel algorithm. The stretch instruction takes in 32 bytes of data and outputs 4 processed pixels. When processing a 128*128 version of the lena image, it takes the function that implements the Sobel algorithm "detectEdges" 7633 cycles. When compared to the unoptimized code, where the "detectEdges" function took 14888 cycles, this equates to a 48.73% reduction. Here are the resources used by the stretch instruction:
Arithmetic bits.................720
Logic bits........................0
Mux bits.........................80
Register bits.....................0
Pipeline bits....................32
AU total..........................832 out of 4096
Multiply bits.....................0
MU total............................0 out of 8192
Extension registers.................0 out of 4096
I would expect both AU's and MU's to be used since both addition and multiplication are present in the stretch instruction. I am a little confused why only AU's are used.
One issue I found when processing an image larger than 256*256, is the following exception occurs: *Warning* Unhandled user exception: LoadStoreTLBMultiHitCause. This has something to do with the "data" array used in the detectEdges function. When the array size is doubled: char data[((rowNum-2)*(colNum-2)*8)]; --> char data[((rowNum-2)*(colNum-2)*16)]; the exception no longer occurs. This issue does not occur when the code is complied with gcc (without the stretch instruction) and executed.
The following table shows % reduction in execution of the "detectEdges" function as more data in is passed into the stretch instruction. The image is 128*128.
Bytes Passed into Stretch Instruction | Pixels Processed per Instruction Call | Cycles per Function Call |
% Reduction |
0 | 0 | 14889 | 0 |
8 | 1 | 8269 | 44.46% |
16 | 2 | 7808 | 47.56% |
32 | 4 | 7633 | 48.73% |
Also, large images like the 512*512 lena image shown below takes a while to execute (about 20 mins) in the Stretch IDE so you will want to use a smaller image to test this code such as 128*128 or smaller.
Using Edge_Detection_Opt.xc and Sobel_Edge_Det.c the results can be seen below.