EE219 project

项目的目的是什么

核心目标是设计自定义指令，实现上层 AI 应用与硬件之间的高效交互。

软件部分

基础实现
即便完全不使用自定义指令，也可以完成推理任务，例如运行经典的 LeNet 模型。然而，这种方式会消耗更多的时间周期，性能较低。
优化实现
使用自定义指令能够显著加速运算，例如指导硬件模块加速矩阵计算的核心任务。

具体步骤包括：

设计自定义指令集，用以支持矩阵运算、卷积等关键功能。
将这些自定义指令通过内联汇编封装成函数接口。
利用这些接口，将原生的 C 语言实现改写为支持自定义指令的版本，从而充分利用硬件加速能力。

硬件部分

为了支持自定义指令，我们需要在硬件中完成以下模块的实现：

卷积层：处理输入特征图和卷积核的矩阵运算。
ReLU 层：激活函数模块，用于非线性变换。
全连接层：实现网络的分类或回归任务。

这些模块共同组成了支持自定义指令的硬件计算框架。

构建流程

当我们运行 make run CFILE=xxx 命令时，实际发生了以下几步操作：

构建处理器（build）
使用 Verilator 将 Verilog 描述的处理器转化为 C++ 仿真模型，同时链接 C++ 激励文件以生成可执行仿真程序。
生成参考结果（model）
使用 Python 脚本运行软件模型，生成用于对比的黄金标准结果。
编译 C 程序（compile）
借助 AM 框架，将 C 源代码编译为汇编指令，并提取出指令二进制文件以供硬件仿真使用。

最后，仿真程序会加载指定的参数运行，并验证仿真结果是否符合预期。

方法论

model.h 文件中没有任何注释，这给调试带来了极大的困扰——一不留神就可能搞不清楚权重参数到底存储在哪里。

原始软件实现

在 model.py 中，我们依次完成了以下任务：

加载预训练参数
包括每一层的权重参数（weights）、输入量化比例（input_scale）以及输出量化比例（output_scale）。
运行推理并保存中间结果
使用加载的模型对 CIFAR 数据集中的一张图片进行推理，同时在推理过程中将每一层的中间计算结果保存为 .npy 文件（如 xxx.npy），便于后续分析。
导出量化权重和缩放因子
遍历每一层的参数，将权重量化到 8-bit，并将缩放因子调整为 2 的幂，最终将这些数据保存为二进制文件（xxx.bin），供硬件加载使用。

硬件仿真测试

为了在 C 程序中实现与 model.py 等价的功能，我们需要解决一些关键问题。

问题：如何在 C 程序中获取权重参数？

由于 AM 框架不支持直接从文件中读取权重参数，这需要通过以下方法解决：

AM 框架的内存访问依赖于 DPI-C 接口实现。这意味着权重参数需要预先加载到一个模拟框架提供的 C 数组中。
xxx.bin 文件中的数据已加载到这个 C 数组中，数据的基地址由宏 ADDR_DATA 定义。
文件 model.h 提供了每一层参数起始地址的宏定义，通过这些宏可以准确定位各层的权重和偏置数据。

加速！

通过定义自定义指令并结合硬件支持，我们可以高效完成推理任务，从而显著提升模型的运行速度。

Now I need to write C code to achieve the following job:
The program will build a neutral framework as this:

        self.conv1  = nn.Conv2d(3, 12, 5, bias=False)
        self.pool   = nn.MaxPool2d(2, 2)
        self.conv2  = nn.Conv2d(12, 32, 3, bias=False)
        self.fc1    = nn.Linear(32 * 6 * 6, 256, bias=False)
        self.fc2    = nn.Linear(256, 64, bias=False)
        self.fc3    = nn.Linear(64, 10, bias=True)

You can read weigh of all layer by accessing the memory, the offset of the start index of each index is listed in model.h which will be presented as follows

#define ADDR_DATA           0x80800000
#define ADDR_SAVE           0x80f00000

#define INPUT_INT8_CONV1    3*32*32
#define WEIGHT_INT8_CONV1   12*3*5*5
#define SCALE_INT8_CONV1    1
#define WEIGHT_INT8_CONV2   32*12*3*3
#define SCALE_INT8_CONV2    1
#define WEIGHT_INT8_FC1     32*6*6*256
#define SCALE_INT8_FC1      1
#define WEIGHT_INT8_FC2     256*64
#define SCALE_INT8_FC2      1
#define WEIGHT_INT8_FC3     64*10
#define BIAS_INT16_FC3      10*2
#define SCALE_INT8_FC3      1

#define OUTPUT_INT8_CONV1   12*28*28
#define OUTPUT_INT8_POOL1   12*14*14
#define OUTPUT_INT8_CONV2   32*12*12
#define OUTPUT_INT8_POOL2   32*6*6
#define OUTPUT_INT8_FC1     256
#define OUTPUT_INT8_FC2     64
#define OUTPUT_INT8_FC3     10

#define ADDR_INPUT          ADDR_DATA
#define ADDR_WCONV1         ADDR_INPUT  + INPUT_INT8_CONV1
#define ADDR_SCONV1         ADDR_WCONV1 + WEIGHT_INT8_CONV1
#define ADDR_WCONV2         ADDR_SCONV1 + SCALE_INT8_CONV1
#define ADDR_SCONV2         ADDR_WCONV2 + WEIGHT_INT8_CONV2
#define ADDR_WFC1           ADDR_SCONV2 + SCALE_INT8_CONV2
#define ADDR_SFC1           ADDR_WFC1   + WEIGHT_INT8_FC1
#define ADDR_WFC2           ADDR_SFC1   + SCALE_INT8_FC1
#define ADDR_SFC2           ADDR_WFC2   + WEIGHT_INT8_FC2
#define ADDR_WFC3           ADDR_SFC2   + SCALE_INT8_FC2
#define ADDR_BFC3           ADDR_WFC3   + WEIGHT_INT8_FC3
#define ADDR_SFC3           ADDR_BFC3   + BIAS_INT16_FC3


#define ADDR_OUTCONV1       ADDR_SAVE   
#define ADDR_OUTPOOL1       ADDR_OUTCONV1 + OUTPUT_INT8_CONV1
#define ADDR_OUTCONV2       ADDR_OUTPOOL1 + OUTPUT_INT8_POOL1
#define ADDR_OUTPOOL2       ADDR_OUTCONV2 + OUTPUT_INT8_CONV2
#define ADDR_OUTFC1         ADDR_OUTPOOL2 + OUTPUT_INT8_POOL2
#define ADDR_OUTFC2         ADDR_OUTFC1   + OUTPUT_INT8_FC1
#define ADDR_OUTFC3         ADDR_OUTFC2   + OUTPUT_INT8_FC2

#endif 

The program can get the input data from ADDR_INPUT, and perform inference process which includes the layers aforementioned whose weight can be accessed using macro in mode.h as listed above.

Finally, the C code will be run on a bare machine, which only provides limited runtime library, which listed as follows:
// string.h
void  *memset    (void *s, int c, size_t n);
void  *memcpy    (void *dst, const void *src, size_t n);
void  *memmove   (void *dst, const void *src, size_t n);
int    memcmp    (const void *s1, const void *s2, size_t n);
size_t strlen    (const char *s);
char  *strcat    (char *dst, const char *src);
char  *strcpy    (char *dst, const char *src);
char  *strncpy   (char *dst, const char *src, size_t n);
int    strcmp    (const char *s1, const char *s2);
int    strncmp   (const char *s1, const char *s2, size_t n);

// stdlib.h
void   srand     (unsigned int seed);
int    rand      (void);
void  *malloc    (size_t size);
void   free      (void *ptr);
int    abs       (int x);
int    atoi      (const char *nptr);

// stdio.h
int    printf_s    (const char *format, ...);
int    sprintf_s   (char *str, const char *format, ...);
int    snprintf_s  (char *str, size_t size, const char *format, ...);
int    vsprintf_s (char *str, const char *format, va_list ap);
int    vsnprintf_s (char *str, size_t size, const char *format, va_list ap);

// printf.h
void _putchar(char character);
#define printf printf_
int printf_(const char* format, ...);
#define sprintf sprintf_
int sprintf_(char* buffer, const char* format, ...);
#define snprintf  snprintf_
#define vsnprintf vsnprintf_
int  snprintf_(char* buffer, size_t count, const char* format, ...);
int vsnprintf_(char* buffer, size_t count, const char* format, va_list va);
#define vprintf vprintf_
int vprintf_(const char* format, va_list va);
int fctprintf(void (*out)(char character, void* arg), void* arg, const char* format, ...);


Help me write c code that perform the above task, and rember add some printf statement for better debugging. If you are confused about anything, please remind me to give you more infomation.
and remember to convert constant to uintptr_t first to avoid type mismatch.

#include "trap.h"
#include "model.h"

/*--------------------------------------------------------------------------------------*
 * Helper functions for memory access. 
 * Always cast the #define addresses to (uintptr_t) first to avoid type mismatch.
 *--------------------------------------------------------------------------------------*/

/**
 * Simple naive conv2d for int8 input/weights, int8 output
 *   in_w, in_h: width & height of the input feature map
 *   kernel_size: e.g. 3 or 5
 *   in_ch, out_ch: input & output channels
 *   stride: typically 1
 *   p_in, p_w, p_out: pointers to input, weight, output memory
 */
static void conv2d_naive_int8(
    const int8_t *p_in,        /* input tensor */
    const int8_t *p_w,         /* weight tensor */
    int8_t       *p_out,       /* output tensor */
    int          in_w,
    int          in_h,
    int          in_ch,
    int          out_ch,
    int          kernel_size,
    int          stride
)
{
    int out_w = (in_w - kernel_size) / stride + 1;
    int out_h = (in_h - kernel_size) / stride + 1;

    // For each output channel
    for(int oc = 0; oc < out_ch; oc++)
    {
        // For each output y/x
        for(int oy = 0; oy < out_h; oy++)
        {
            for(int ox = 0; ox < out_w; ox++)
            {
                // Compute convolution
                int32_t acc = 0;  // 32-bit accumulator
                // Weighted sum
                for(int ic = 0; ic < in_ch; ic++)
                {
                    for(int ky = 0; ky < kernel_size; ky++)
                    {
                        for(int kx = 0; kx < kernel_size; kx++)
                        {
                            int in_x = ox * stride + kx;
                            int in_y = oy * stride + ky;

                            int8_t val_in = p_in[ic * (in_w*in_h) + (in_y * in_w) + in_x];
                            int8_t val_w  = p_w[oc * (in_ch * kernel_size * kernel_size)
                                             + ic * (kernel_size * kernel_size)
                                             + ky * kernel_size + kx];
                            acc += (int32_t)val_in * (int32_t)val_w;
                        }
                    }
                }
                // Naive scale or shift if needed. For now, just clamp to int8:
                // In a real scenario, you'd do: acc = (acc * scale) >> shift, or something similar.
                if(acc > 127)  acc = 127;
                if(acc < -128) acc = -128;

                p_out[oc * (out_w*out_h) + oy * out_w + ox] = (int8_t)acc;
            }
        }
    }
}

/**
 * Simple MaxPool2D with kernel=2, stride=2, no overlap, int8
 *   in_ch: # input channels
 *   in_w, in_h: input width & height
 *   out_w, out_h: output width & height
 */
static void maxpool2d_2x2_stride2_int8(
    const int8_t *p_in,
    int8_t       *p_out,
    int          in_ch,
    int          in_w,
    int          in_h
)
{
    // With 2x2, stride=2, output dims are in_w/2, in_h/2 (assuming in_w, in_h are even)
    int out_w = in_w / 2;
    int out_h = in_h / 2;

    for(int c = 0; c < in_ch; c++)
    {
        for(int oy = 0; oy < out_h; oy++)
        {
            for(int ox = 0; ox < out_w; ox++)
            {
                int8_t max_val = -128;
                for(int ky = 0; ky < 2; ky++)
                {
                    for(int kx = 0; kx < 2; kx++)
                    {
                        int in_x = ox*2 + kx;
                        int in_y = oy*2 + ky;
                        int8_t val = p_in[c*(in_w*in_h) + in_y*in_w + in_x];
                        if(val > max_val)
                            max_val = val;
                    }
                }
                p_out[c*(out_w*out_h) + oy*out_w + ox] = max_val;
            }
        }
    }
}

/**
 * Fully-connected (linear) layer with int8 input, int8 weight, optional int16 bias, int8 output.
 *   in_features, out_features: e.g., for FC1: in_features=32*6*6=1152, out_features=256
 *   p_in, p_w, p_b, p_out: pointers to input, weight, bias, output memory
 *   has_bias: 1 if bias is present, 0 if no bias
 */
static void fc_naive_int8(
    const int8_t *p_in,
    const int8_t *p_w,
    const int16_t *p_b,
    int8_t       *p_out,
    int          in_features,
    int          out_features,
    int          has_bias
)
{
    for(int o = 0; o < out_features; o++)
    {
        int32_t acc = 0;
        // Dot product
        for(int i = 0; i < in_features; i++)
        {
            int32_t val_in = (int32_t)p_in[i];
            int32_t val_w  = (int32_t)p_w[o*in_features + i];
            acc += val_in * val_w;
        }
        // Add bias if present (note that BIAS_INT16_FC3 says it's 16-bit, so cast properly)
        if(has_bias)
        {
            acc += (int32_t)p_b[o];
        }
        // Naively clamp to int8
        if(acc > 127)  acc = 127;
        if(acc < -128) acc = -128;

        p_out[o] = (int8_t)acc;
    }
}

/*--------------------------------------------------------------------------------------*
 * The main inference function
 *--------------------------------------------------------------------------------------*/
int main(void)
{
    // 1) Convert addresses to pointers
    int8_t  *p_input   = (int8_t  *)(uintptr_t)(ADDR_INPUT);

    int8_t  *p_wconv1  = (int8_t  *)(uintptr_t)(ADDR_WCONV1);
    int8_t  *p_wconv2  = (int8_t  *)(uintptr_t)(ADDR_WCONV2);

    int8_t  *p_wfc1    = (int8_t  *)(uintptr_t)(ADDR_WFC1);
    int8_t  *p_wfc2    = (int8_t  *)(uintptr_t)(ADDR_WFC2);
    int8_t  *p_wfc3    = (int8_t  *)(uintptr_t)(ADDR_WFC3);
    int16_t *p_bfc3    = (int16_t *)(uintptr_t)(ADDR_BFC3);


    int8_t  *p_outconv1= (int8_t  *)(uintptr_t)(ADDR_OUTCONV1);
    int8_t  *p_outpool1= (int8_t  *)(uintptr_t)(ADDR_OUTPOOL1);
    int8_t  *p_outconv2= (int8_t  *)(uintptr_t)(ADDR_OUTCONV2);
    int8_t  *p_outpool2= (int8_t  *)(uintptr_t)(ADDR_OUTPOOL2);
    int8_t  *p_outfc1  = (int8_t  *)(uintptr_t)(ADDR_OUTFC1);
    int8_t  *p_outfc2  = (int8_t  *)(uintptr_t)(ADDR_OUTFC2);
    int8_t  *p_outfc3  = (int8_t  *)(uintptr_t)(ADDR_OUTFC3);

    // Debug prints
    printf("DEBUG: p_input   = 0x%08x\n", (unsigned)(uintptr_t)p_input);
    printf("DEBUG: p_wconv1  = 0x%08x\n", (unsigned)(uintptr_t)p_wconv1);
    printf("DEBUG: p_outconv1= 0x%08x\n", (unsigned)(uintptr_t)p_outconv1);
    // ... add more as needed

    //----------------------------------------------------------------------------------
    // 2) Conv1: in=(3,32,32), out=(12,28,28), kernel=5, stride=1
    //----------------------------------------------------------------------------------
    printf("Running Conv1...\n");
    conv2d_naive_int8(
        /* p_in   */ p_input,
        /* p_w    */ p_wconv1,
        /* p_out  */ p_outconv1,
        /* in_w   */ 32,
        /* in_h   */ 32,
        /* in_ch  */ 3,
        /* out_ch */ 12,
        /* ksize  */ 5,
        /* stride */ 1
    );

    //----------------------------------------------------------------------------------
    // 3) Pool1: MaxPool2D(2,2) => in=(12,28,28), out=(12,14,14)
    //----------------------------------------------------------------------------------
    printf("Running Pool1...\n");
    maxpool2d_2x2_stride2_int8(
        /* p_in   */ p_outconv1,
        /* p_out  */ p_outpool1,
        /* in_ch  */ 12,
        /* in_w   */ 28,
        /* in_h   */ 28
    );

    //----------------------------------------------------------------------------------
    // 4) Conv2: in=(12,14,14), out=(32,12,12), kernel=3, stride=1
    //----------------------------------------------------------------------------------
    printf("Running Conv2...\n");
    conv2d_naive_int8(
        p_outpool1,
        p_wconv2,
        p_outconv2,
        14,
        14,
        12,
        32,
        3,
        1
    );

    //----------------------------------------------------------------------------------
    // 5) Pool2: in=(32,12,12), out=(32,6,6)
    //----------------------------------------------------------------------------------
    printf("Running Pool2...\n");
    maxpool2d_2x2_stride2_int8(
        p_outconv2,
        p_outpool2,
        32,
        12,
        12
    );

    //----------------------------------------------------------------------------------
    // 6) FC1: in=32*6*6=1152, out=256, no bias
    //----------------------------------------------------------------------------------
    printf("Running FC1...\n");
    fc_naive_int8(
        /* p_in        */ p_outpool2,
        /* p_w         */ p_wfc1,
        /* p_b         */ NULL,  // no bias for FC1
        /* p_out       */ p_outfc1,
        /* in_features */ 32*6*6,
        /* out_features*/ 256,
        /* has_bias    */ 0
    );

    //----------------------------------------------------------------------------------
    // 7) FC2: in=256, out=64, no bias
    //----------------------------------------------------------------------------------
    printf("Running FC2...\n");
    fc_naive_int8(
        /* p_in        */ p_outfc1,
        /* p_w         */ p_wfc2,
        /* p_b         */ NULL,  // no bias for FC2
        /* p_out       */ p_outfc2,
        /* in_features */ 256,
        /* out_features*/ 64,
        /* has_bias    */ 0
    );

    //----------------------------------------------------------------------------------
    // 8) FC3: in=64, out=10, with int16 bias
    //----------------------------------------------------------------------------------
    printf("Running FC3...\n");
    fc_naive_int8(
        /* p_in        */ p_outfc2,
        /* p_w         */ p_wfc3,
        /* p_b         */ p_bfc3,
        /* p_out       */ p_outfc3,
        /* in_features */ 64,
        /* out_features*/ 10,
        /* has_bias    */ 1
    );

    //----------------------------------------------------------------------------------
    // Done: p_outfc3 now has the logits for the 10 classes. 
    // Print them out for debugging (optional).
    //----------------------------------------------------------------------------------
    printf("Inference done. Printing final FC3 output:\n");
    for(int i = 0; i < 10; i++)
    {
        printf("Logit[%d] = %d\n", i, (int)p_outfc3[i]);
    }

    // You might want to return an integer code or loop forever in a bare-metal system.
    return 0;
}