Tianke Youke

Where is my moveable feast

2024-05-31T09:30:32.000Z

很久没记录。比如说，再读海明威的《流动的盛宴》的时候，是在去了一趟欧洲之后。看到了那种传统欧洲小城的构造，沿河而建，一条一条的路通往河边，河上每座桥基本都有冠名的雕塑，河边几条街道外就是大广场和大教堂。

见过这种之后，再读这本书，对一百年前（吗？）的巴黎就有感觉了。海明威说他去新开的咖啡店喝奶油咖啡充饥，我就想到海德堡临河T字路口那家咖啡店了。人来人往的，桌子也不多，但是在这种涌动的拥挤的环境里，那家店竟然能有安静的氛围。不知道会不会贵，所以没去。不过这么个路过瞥了一眼的咖啡店竟然就成了记忆的锚点了。还是应该多记录一些见闻，这几年有很多新鲜的见闻，都空存在语言以及任何媒介之外了。

Where are my memories? Where is my moveable feast?

look at my new keyboard!!

2024-05-29T18:14:32.000Z

Rainy 75 + corn cob keycaps

Borges and AI 读后感

2024-04-27T10:57:45.000Z

论文地址：https://arxiv.org/abs/2310.01425

作者：Léon Bottou, Bernhard Schölkopf

发表： Arxiv

首先我们可以用博尔赫斯的“小径分叉的花园”的隐喻，来类比大语言模型。人们做出一个选择的时候，所付出的代价是放弃了其他所有可能的选择。那如果把所有可能的选择完全考虑进来，就能得到一个充满无数可能性分叉的花园。这个花园，就可以类比“完美语言模型”。它包含了所有可能的人类语言的组合。当它在聊天框输出一串文字的时候，就像是从分叉的小径中选择其中一条一样，或者是人类做出一个选择的时候一样。未被选择的词语构成了其他的分叉。“语言模型”这个全包的概念，是完美的，包含了所有语言的可能性。

实际上，“完美语言模型”就像一本预言书，里面包含了所有我们想要听到的话、可能听到的话。唯一能影响它输出的内容的，就是和它展开对话的那个人。我们用prompt作为引子来引导语言模型的生成，这其实就是对无限的可能性施加限制，prompt完全决定了语言模型的输出。

我们可以先从“幻觉”这个所谓的问题讨论起。其实这不是“幻觉”，而是一种“虚构”，是从人类语言中所有有可能的分叉中，适当地挑选出可能性高的那些，借用了一些合理的逻辑，虚构了另一套合理的故事。“完美的语言模型”，倒不如说是一个“虚构小说机器”。

一类人认为有一些大逆不道的话永远不该存在，所以要求审查LLM；更多的另一类人是希望LLM真的像智能一样为人服务、创造价值，所以要在必要的场合说必要的话。两者为了各自的目的，都希望对LLM这种本身不包含任何是非对错评判偏差的“完美语言模型”进行剪枝，去掉不想要的部分。所用的手段就是用人工精挑细选的语料来微调，或者叫human feedback。

实际上这种审查有很多的问题。最大的问题是，有了“虚构小说机器”之后，它对我们人类文化的影响比想象中大，我们会依赖它来塑造我们的知识和对未来的想象，因此它会影响整个人类的文化。当来到这个层次后，“审查”，或者说“净化”，就变得危险了，因为谁都想把自己的那一套强加在LLM之上，可是谁的标准才是真的标准呢，或者我们真的能有真正的“净化”标准吗？

“在未来，几乎每个人都使用语言模型来丰富他们的思维，对语言模型所写内容的控制权将成为对我们所思考内容的控制权。如此强大的力量能存在而不被滥用吗？

“有些人担心小说机器是一种无所不知的人工智能，可能会比我们活得更久；然而，更黑暗的诱惑是让我们的思想屈服于这个现代的皮提亚，不受真理和意图的影响，但却可以被他人操纵。如果我们一直把小说机器误认为是可以减轻我们思考负担的人工智能，那么语言模型无休止的喋喋不休会让我们像苦苦挣扎的图书馆员一样疯狂。然而，作为小说机器，它们的故事可以丰富我们的生活，帮助我们重温过去，了解现在，甚至瞥见未来。”

作者最后说，人们发明的这种“机器”，不仅能写故事，而且可以写故事的所有变体，这是人类历史上一个重要的里程碑，堪比印刷机的发明。或者，甚至可以比作早在印刷或书写被发明之前的，在洞穴壁画之前就出现的，一种塑造人类的艺术：讲故事的艺术。

Score-based generative model

2024-04-09T20:08:34.000Z

references:
https://yang-song.net/blog/2021/score/
https://deeplearning.neuromatch.io/tutorials/W2D4_GenerativeModels/student/W2D4_Tutorial2.html#
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Score-based models v.s. diffusion models

一开始两者独立发展，所以有不同的理论依据和术语
最后两者殊途同归：得到一样的模型

Score function

生成模型可以归类为两类：
1. 显式的，基于likelihood的模型，比如autoregressive，normalizing flow，VAE。这些方法通过近似最大似然来建模概率密度分布。问题是，为了计算likelihood，需要一个normalizing constant这个量，它要么是未知的，要么要想尽办法通过其他的限制来估计或者消除。所以这类模型比较复杂。
2. 隐式的，直接建模一个概率分布的sample process，而不是概率分布本身。比如GAN。但是它需要对抗学习，这比较不稳定，而且可能有模式坍塌的问题。
Score function和第一类相关，它也是建模likelihood——但不是它本身，而是跟它相关的一个量，即 Stein score function。这样的一个神奇好处是，可以直接消除normalizing constant这个量。 （图中，等高线表示一个概率分布，箭头表示它的分数场。score-based model就是建模这些分数场)
具体一点来说：
1. 假设给定一个数据集 \({x_1,x_2,...,x_N}\)，每个数据都是从一个未知的数据分布 \(p(x)\) 采样。我们用一个生成模型来生成新的数据，这些数据都是从这个分布中采样的
2. 我们要想办法表示这个概率分布。前面说的基于likelihood的模型是这么做的：直接建模这个概率密度函数： \[ p_\theta(x)=\frac{e^{-f_\theta(x)}}{Z_\theta} \] 其中这个 \(Z_\theta>0\) 就是依赖于 \(\theta\) 的那个normalizing constant了。这个 \(p_\theta(x)\) 训练的目标函数就是最大log-likelihood： \[ \max_\theta\sum^N_{i=1}\log p_\theta(x_i) \]
3. 前面说了，问题就是 \(Z_\theta\) 很难估计。为了避开估计它，我们的神经网络不再直接估计概率分布，而是估计它的分数（概率密度函数的log的梯度）： \[s_\theta(x)=\nabla_x\log p(x)=-\nabla_x f_\theta(x)-\nabla_x\log Z_\theta = -\nabla_x f_\theta(x)\] 去掉了 \(Z_\theta\)！
4. 训练的目标函数是 Fisher divergence： \[ \mathbb E_{p(x)}[\|\nabla_x\log p(x)-s_\theta(x)\|^2_2] \]
5. 现在的问题就是，上式中的 \(\nabla_x\log p(x)\)未知。但是很好解决，一种叫score matching的方法可以直接最小化 Fisher divergence，不需要知道真实的score。
6. 最后的问题就是得到了这个 score function 之后，怎么从中采样新的数据了。Langevin dynamics 提出了一种迭代式的采样方法，就是和diffusion的步骤一模一样的。相当于从空间中任意一个位置初始化，然后顺着score function往高概率密度的方向优化，足够多的步骤之后就到了峰值处。

关 noise 什么事？

是这样的，前面提到的方法已经讲清楚了神经网络建模和目标函数。但是假如直接拿着数据集（比如图像）让网络学习的话，效果并不好。因为这个score function在低概率密度的区域样本很少，学得也很不好。

为了解决这个问题，我们才往数据中加 noise，在被噪声扰动后的数据集上训练网络。这些扰动后的数据点极大地扩充了数据集，最主要的是能填充那些低概率密度的分布区域。

大的噪声破坏数据分布，小的噪声不够填充低概率密度区域。所以就用多尺度的噪声。也就是diffusion model的前向过程了。

总体来说，score-based model就是在这些噪声扰动后的数据集上训练的。训练的时候，噪声的尺度当然也可以作为一个已知量输入，也就是noise conditional \(s_\theta(x,i)\)

diffusion models

to read：https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

从diffusion models的角度解释整个模型

Reading Learning Locally Editable Virtual Humans

2023-12-12T09:02:56.000Z

论文地址：https://files.ait.ethz.ch/projects/custom-humans/paper.pdf

作者：Hsuan-I Ho, Lixin Xue, Jie Song, Otmar Hilliges。ETH

发表： CVPR23

链接： https://custom-humans.github.io/

Why：

3D数字人现在很热门
传统方法中，重建好一个3D数字人之后，不知道怎么编辑和定制它
所以想搞一个方法，既能创建3D数字人，又能在创建完成后对它进行定制化

What：

这个方法允许在不同的3D人之间部分地迁移几何和外貌细节，还能在改变人的姿势的时候保留一致的局部细节
具体一点说，是把NeRF和LBS-articulated（铰接）的mesh模型的优点结合起来：NeRF很灵活，有很强的建模能力；mesh模型可以变形，并且可以被完全显式地控制
再具体一点，是提出了一个混合的3D人体representation，允许跨不同主体进行局部编辑；然后提出了一整套的生成这种3D数字人的流程，既能拟合没见过的3D扫描人数据，也能随机采样生成新的个体；另外还提出一个大尺度高质量3D扫描人体数据集

How：

混合的3D人体representation：
1. 有一个可学习的特征codebook，包含\(M\times 2F\)个特征，其中\(F\)个是几何特征，另外\(F\)个是外貌特征。给定一个human mesh，mesh有M个顶点（M很大，一万多），每个顶点跟codebook中的特征显式地一一对应。
2. 当NeRF渲染时，给定空间中一个query点，提取局部的特征：
  image-20231212173121173
  1. 找出mesh顶点中离它最近的3个顶点，对这3个顶点对应的特征用barycentric interpolation（重心插值法？）得到插值后的特征
  2. 还需要把全局坐标转换成局部坐标，这么做是为了让局部与全局解耦，方便在后续更改人体pose的时候不影响局部的细节。转换方式：还是根据mesh顶点中离它最近的3个顶点组成的三角形，直接用\((u,v)\)表示query点投影到三角形平面上的点在三角形中的位置，然后再加上一个\(d\)表示query点距离三角形平面的距离，以及一个\(\textbf n\)表示距离的方向
  3. decoder，或者说renderer，不是单独一个NeRF，而是分成了两个独立的，一个是SDF field，一个是rgb field。论文里说这样能方便用3D loss显式地监督这两个网络。后面的实验证明如果不解耦这两个网络的话，效果会差很多
  4. 那么最后给到这两个neural fields的输入，就是\(\textbf f_s/\textbf f_c,(u,v,d),\textbf n\)，分别是几何特征或者颜色特征、局部坐标、方向
采样个体样本时，有两种方式：
1. 直接采样已有的个体样本。训练的时候，针对每个个体样本，都是单独学习一个codebook。假如有N个ground truth人体，那codebook实际上是有\(N\times M\times 2F\)这么多维度。想要采样其中一个个体，直接从N个样本中取一个entry就好了
2. 生成全新的样本？用PCA。具体说，是创建一个新的D维的codebook（\(D\times 2MF\)），这D个维度是对N个人体样本拟合出的PCA系数。生成随机新人体样本的时候，只要简单随机生成D个PCA参数，然后乘以利用这个新codebook算好的特征向量就好了
训练过程
1. 给定一个扫描的3D人体（高质量、很细节的mesh），用其他现有工具把一个SMPL mesh和它对齐，得到相应的pose和shape参数。
2. 用M个扫描的3D人体，分别训练特征codebook的M个entries。
3. 三方面的loss：
  1. 3D loss，包括rgb loss和sdf loss；
  2. 2D adversarial loss，这里是用前面提到的PCA采样方法，得到随机生成的人体，渲染成2D图像；然后用相同的相机视角用任意gt 3D人体扫描得到2D图像，然后用StyleGAN辨别这两张真假图像。这里的用意是，不需要严格相同的gt监督，就可以对PCA得到的生成样本进行监督，这样的监督又能传播到全部训练样本上，应该是能很好地提升模型的泛化性能。实验结果看这个loss还挺重要的。
  3. 简单的对特征的正则项，让特征符合高斯分布
一些编辑方式
image-20231212175622543
1. 初始化：采样一个3D数字人样本。也就是前面提到的两种采样方式，可以采样已有的，也可以采样全新生成的
2. optimize特征，拟合一个3D扫描人体。用到的是3D loss
3. 跨个体的特征编辑：简单粗暴，先用Blender选择人体局部对应的顶点，然后把想要的样本的codebook中的那部分特征交换到目标的codebook来
4. 绘制材质：拿到渲染后的2D图像后，可以直接对2D图像进行一些绘制，然后再用它监督训练codebook。后面的实验结果看上去效果还不错，但limitation说不能很好地拟合过于高精度的细节
  image-20231212180508779
5. 更换人体姿势：这个representation本身是由一个SMPL mesh加一个codebook组成的；只要对SMPL mesh的参数进行修改，就能直接改变姿势了
实验
1. 用了自己提出的CustomHumans以及THuman2.0数据集来训练模型；用SIZER数据集来测试对没见过的人体扫描的拟合性能
2. 用chamfer distance、normal consistency、f-score来衡量拟合性能；结果挺好
3. 展示了编辑性能，看起来挺好
4. ablation study：值得深究

Reading Neural Capture of Animatable 3D Human from Monocular Video

2023-11-23T20:46:40.000Z

论文地址：

作者：Gusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan Lu

发表： ECCV 2022

链接： https://arxiv.org/abs/2208.08728

Why：

之前的3D人体重建工作一般需要多视角视频，或者额外的3D几何信息。这篇工作以单视角视频为输入
之前的工作建模出来的3D人体很难泛化到新的pose
之前的工作都只能解决一部分的问题：基于参数化人体模型的方法对appearance的表示精度有限；基于NeRF的方法的appearance效果好，但是要么只关注于NeRF场本身的构建，要么需要精确的3D mesh作为先验。

What：

提出一个从单视角视频重建animatable的3D人体的方法
表示方法是把 dynamic NeRF 和一个 human mesh （SMPL）相结合。这个 dynamic NeRF 的输入是一些嵌入到mesh 顶点的局部信息，这样，当需要表现一个人不同的姿势的时候，本质上是对这个 canonical space 的静态NeRF进行deformation。这里的关键问题是如何设计这个局部信息，来让查询observation space中的任意一点的时候，都能够良好地deform到canonical space，从而找到静态NeRF里正确的点
在优化过程中，首先借用别的工具初始化一个mesh pose，然后逐帧地同时 finetune mesh pose 和 NeRF

How：

Query embedding for NeRF

image-20231124051120340

精髓就在图里了：

人体的表现形式是我们熟悉的：首先有一个由pose参数\(\theta\)驱动的SMPL mesh，以及一个mesh-guided NeRF，后者的输入是对应query ray上的3D points的embedding
Query embedding的具体构成：
1. 最直观的 Latent Code：存储在每一个mesh顶点上，表示的是appearance信息。对于一个query point，会找到mesh顶点中K nearest neighbors所对应的latent codes
2. 被称为 Deformation Guidance：其实是在canonical space中，刚刚用到的那些KNN顶点的坐标（用inverse LBS得到），以及query point相对于mesh表面投影点的方向。这个信息能够指导deformation field，所以叫guidance
3. 另一方面还有 Deformation Priors：是在observation space中，query point相对于刚刚用到的那些KNN顶点的距离。文中说这是用来防止deformation field落入local minima的，所以叫做priors
image-20231124052407094
文中特意用上图强调了这里需要用到K近邻顶点，而不是单个最近的顶点。因为如果只用单个最近的顶点（图a），就不能提供不同的deformation pattern 的信息；而（b）里加上了K-NN distance之后，就能有这个deformation pattern信息了。（我有点疑惑什么是deformation pattern，就是这个表面的凹凸性吗？）

训练过程

首先借用别的工具初始化一个mesh pose。但是这个pose不够精准，还需要finetune。文中提到这里直接是finetune per-frame pose parameter，而不用per-vertex offset，因为后者可能容易过拟合到local minima（？有点疑惑，我以为用pose parameter或许是有利于在后续用temporal consistency之类的，但是好像并没有用到；那pose param相比之下就是一个更不精准而已？）
训练时逐帧地同时 finetune mesh pose 和 NeRF。loss很直观：
1. NeRF渲染图和原视频帧的L2 loss
2. 正则 \(\|\theta-\theta^0\|^2_2\)，是为了让每帧的pose finetune不至于太偏离初始估计

实验

训练要在v100上60小时；数据集用到People-Snapshot、DoubleFusion、ZJU-MoCap、Human3.6M；指标用PSNR和SSIM
在2022，没有什么能直接对比的其他工作，跟需要多视角视频输入的AniNeRF、底层架构很不同的A-NeRF、mesh-based 的方法VideoAvatar比了三下，比他们都好

image-20231124054314933

KL散度（Kullback-Leibler Divergence，相对熵）

2023-10-11T20:04:11.000Z

References:
https://zhuanlan.zhihu.com/p/45131536
https://hsinjhao.github.io/2019/05/22/KL-DivergenceIntroduction/
https://www.jiqizhixin.com/articles/2018-05-29-2

含义

一句话：KL散度可以用来衡量两个分布之间的差异/匹配程度。但它并不是一个真正的度量或者距离，因为它不具有对称性。
广义的散度指的是一类运算，它将矢量空间上的一个矢量场对应到一个标量场上，通俗的讲，就是输入一组矢量，返回一个标量。
在统计学意义上来说，KL散度可以用来衡量两个分布之间的差异程度。若两者差异越小，KL散度越小，反之亦反。当两分布一致时，其KL散度为0。正是因为其可以衡量两个分布之间的差异，所以在VAE、EM、GAN中均有使用到KL散度。
在信息论中，其可理解为编码系统对信息进行编码时所需要的平均附加信息量。

定义

理解熵

熵是用来表示信息量。
首先，我们考虑一个离散的随机变量 \(x\) 。当我们观察到这个变量的一个具体值的时候，我们接收到了多少信息呢？
信息量可以被看成在学习 \(x\) 的值的时候的“惊讶程度”。如果有人告诉我们一个相当不可能的事件发生了，我们收到的信息要多于我们被告知某个很可能发生的事件发生时收到的信息。如果我们知道某件事情一定会发生，那么我们就不会接收到信息。于是，我们对于信息内容的度量将依赖于概率分布\(p(x)\) ，因此我们想要寻找一个函数\(h(x)\) ，它是概率\(p(x)\)的单调递增函数，表达了信息的内容。 \(h(.)\) 的形式可以这样寻找：如果我们有两个不相关的事件 x 和 y ，那么我们观察到两个事件同时发生时获得的信息应该等于观察到事件各自发生时获得的信息之和，即 \(h(x)+h(y)=h(x,y)\) 。两个不相关事件是统计独立的，因此\(p(x)p(y)=p(x,y)\)。根据这两个关系，很容易看出\(h(.)\)一定与\(p(.)\) 的对数有关。因此，我们有： \[ h(x)=-\log p(x)\] 其中，负号确保了信息一定是正数或者是零。注意，低概率事件\(x\) 对应于高的信息量。
现在假设一个发送者想传输一个随机变量的值给接收者。这个过程中，他们传输的平均信息量通可以通过求上式关于概率分布 p(x) 的期望得到。这个期望值就是熵Entropy： \[ H[x]=-\sum_x p(x)\log p(x)\]
它是这个随机变量的平均信息量。

KL散度的性质

\(D_{KL}(p||q)>=0\)，当且仅当\(p(x)=q(x)\)时取等号
不满足对称性，即\(D_KL(p||q)\neq D_KL(q||p)\)

讨论

KL散度是不对称的。因此不能作为一个距离度量，在使用时往往有一些问题。
那么就可以用\(\alpha\)-散度。KL散度是它的一个特殊化。根据这个，还能算出一个对称的Hellinger距离，它的平方根是一个合法的距离度量
还可以推广到F散度：把KL散度公式中的\(\log\)函数替换为任意的函数f，只要f满足这两个条件：
1. f是一个凸函数
2. \(f(1)=0\) 此时F散度的表达式为： \[ D_f(p\|q)=\int q(X)f(\frac{p(X)}{q(X)})dX\] 当\(f(X)=X\log X\)时，就是KL散度了。
Bregman散度：这是从另一个角度来思考“距离”。最常见的均方欧氏距离，推广到任意维度的函数之间的距离，同样只是需要一个凸函数就能表达了。这个凸函数的取值，可以表示一大片不同的散度，都属于Bregman散度的特例
Wasserstein距离。这是用来解决一个问题的：如果两个分布离得太远，完全没有重叠，那么KL散度的值会失去意义。这在深度学习中意味着这一点梯度为0——梯度消失！ Wasserstein距离可以解决这种问题，也叫做Earth-Mover（推土机）距离：当我们希望把一堆土推移成另一堆土的形状和位置，推土代价定义为移动土的量*土移动的距离，这个代价就是两个分布的Wasserstein距离。 Wessertein距离相比KL散度和JS散度的优势在于：即使两个分布的支撑集没有重叠或者重叠非常少，仍然能反映两个分布的远近。WGAN就是Wasserstein距离比较经典的应用之一。

信息的压缩

假如我们有一组样本，每个样本分别有不同的值。可以直接记下每个样本的信息。
也可以用这个样本的分布来表示同样的信息量。（每个取值的概率）
还可以用一个已知的分布来表示这个分布（比如均匀分布、二项分布、正态分布），只需要记下具体分布的参数。

Reading Handy: Towards a high fidelity 3D hand shape and appearance model

2023-07-17T11:32:07.000Z

论文地址：https://rolpotamias.github.io/Handy/

作者：Rolandos Alexandros Potamias, Stylianos Ploumpis, Stylianos Moschoglou, Vasileios Triantafyllou, Stefanos Zafeiriou. From Imperial College London and Cosmos.

发表： CVPR23

链接： https://github.com/rolpotamias/handy

handy

如果你去做这个任务，会怎么做？作者做的方法和你想的有什么差异？

Q：我感觉这个任务听起来还挺直观的，就是用GAN去训练外观，定义一些更多vertices的mesh template，用超级大量的样本去训练堆效果嘛？hand model的定义会有什么新意吗？我倒是想不出来。

A：确实很直观，hand model的定义没什么太大区别。贡献点主要在于：1. 很大很好很variant的新数据集，造成了很好的Handy 2. 用StyleGAN来学texture，而不是传统的PCA，得到的texture更高频细节，更好。

Why：

VR AR发展，对人手的建模、追踪和重建的研究变得流行，因为手是一个重要的显示人的行为的东西
大部分工作基于MANO，只有很粗糙的low polygon count，而且只基于31个样本构建，distribution不够宽
大部分工作都忽略了材质的构建

What：

提出一个large-scale的hand model，包含了形状和外观，用超过1200个人类样本训练，样本有large diversity
构建Synthetic dataset，训练一个hand pose estimation网络，从单张图像中重建手
提出一个基于GAN的有高频细节的手的外观+形状重建方法，即使是in-the-wild的单视角图像作为输入

读前疑问：

看上去作者是用NeRF做了一个high fidelity的hand model。我不太清楚技术细节如何实现，尤其是nerf如何跟parametric model结合，如果训练一个nerf layer，让它可以根据单张输入图像infer一个新手。不知道我哪里来的误解，总之不是用的nerf诶……
fig 1 看上去效果有点假……似乎是皮肤反光率的问题，用的什么lighting representation呢？没什么representation，纯粹用PCA去掉了阴影成分
居然连皱纹、血管、指甲油也能出来，确实是高频细节了。有针对这些东西做特别的优化吗？还是全是那个style-based GAN的功劳，或者大样本量的功劳呢？真是大力出奇迹呀。还真就是GAN的功劳……？

How：

1. 收集large-scale数据集

raw scan：3000 vertices meshes。1208个人，包括关于他们的meta data，比如性别，年龄，身高，种族等。这些人的diversity比较大

2. 形状重建

对齐3D scans 和 mesh template
1. 用了两组template，一个是低分辨率的MANO，它可以直接用进SMPL人体模型中，有778个顶点；一个是高分辨的template，有8407个顶点
2. 获得稠密的correspondence的方法是：
  1. 从多视角渲染这些raw scans，用MediaPipe来检测2D关键点
  2. 用linear triangulation来把2D关键点转换到3D；利用手指骨架到表面尖端的投影来检测指尖。
  3. 用3D关键点来把template和3D scans的表面对齐
  4. 用Non-rigid Iterative Closest Point algorithm (NICP)来registration，寻找稠密的顶点对应关系
转换成规范的张开手掌的姿势
1. 用PCA构建一个手部形状模型。
2. 公式和MANO几乎一样，\(\beta\) \(\theta\) 两个参数，分别是形状和姿势参数。

3. 高分辨率外观模型

叫一个图像学艺术家（😳）设计了一个UV template，把scans给unwrap成那样了
对UV textures进行预处理，去掉阴影和光照：用PCA来识别描述阴影的因素，然后把这些因素去掉。（PCA居然这么好用？！）
用一个图像处理步骤，将手部纹理映射到更自然的颜色，包括增加亮度，伽玛校正，以及调整色调。
训练过程：不像其他方法那样直接把外观空间映射到一个低频PCA域，而是用GAN来建模材质。学习率较小，0.001；一个正则权重50也很有效。（啊？这个GAN就这么一句带过吗？直接用的StyleGANv3？）

实验

和MANO比hand model：
1. 更紧致，5个主成分表现90% variance，mano需要9个才行
2. 泛化到数据集外的手的能力更强
3. 特异性误差（specificity error）？衡量生成的手和ground truth的误差
重建小孩的手，效果更好
从单张图像进行3D重建：
1. 生成数据集：用自己训的GAN模型生成30000张图像，为了更真实，渲染的手跟ShapeNet中的物体有交互，以及是和用SMPL表示的人放在一起的
2. 模型直接参考3，14，16；加了一个预测材质参数的分支
3. loss：L2 between estimated and gt shape parameter， pose parameter，and 3D vertices； L1 between estimated and gt UV map；L1 between estimated and gt 2D image；LPIPS loss on two images
4. 另外设计了in-the-wild数据集，用预训练的模型预测handy 姿势、形状和材质参数，然后只优化材质参数w来拟合材质。
5. 优化函数包括L1 and LPIPS loss on two images，以及一个对w的L2正则。得到了改进的材质参数w‘之后，finetune回归网络。
6. 为了定量评估所提出方法的纹理重建，我们向网络提供数据中使用的扫描设备的图像。gt UV map用的是之前registration后得到的。（我不理解诶，这样真的能跟HTML公平比较吗？一方面你的handy就是从这些数据中来的，当然能对in-distribution的东西拟合得更好啊？另一方面HTML生成的UV map和你的定义是一样的吗？这个gt UV map对它来说有用吗？）
7. 结论是：handy+GAN能得到高频细节，甚至皱纹、戒指、纹身、指甲油、白癜风之类的；handy+PCA会过渡平滑，甚至对肤色的重建失败；HTML更不行。
Test on FreiHand 刷新了指标，7.8 MPVPE and MPJPE……
从点云重建形状和姿势。降维打击了MANO和LISA，即使用Hand+MANO+10个PCA Components，也比其他方法好很多……

ChatGPT Applications to be explored

2023-07-16T17:57:48.000Z

今天逛 github，发现了一些很 amazing 的chatgpt applications，摘录一些感兴趣的精华在此。真是感慨：LLM 以来天天风云变幻，弄潮儿在前面兴风作浪，我在后面望其项背……

(Useful) egoist / openai-proxy
用 Vercel 开一个小的 Proxy server，转发 gpt API，这样可以绕开有些国家地区的 IP 限制
BuilderIO / ai-shell
在命令行里使用 chatgpt，把自然语言转化成 Linux commands，命令是 ai [texts]
eli64s / readme-ai
一个轻量的 script，根据 repository 生成酷炫的 readme 文件
efJerryYang / chatgpt-cli
命令行 chatgpt client
yufeikang / ai-cli
另一个命令行 chatgpt client（实测的时候再对比一下这俩）
mukulpatnaik / researchgpt
输入论文 PDF 文件，然后和 gpt 聊论文。一个用 Flask 开发的 web client 貌似，可以再仔细看一下咋实现的，挺有意思
(⭐️ Amazing) AntonOsika / gpt-engineer
很方便安装，pip install 就好了！直接通过描述 + AI 追问 + 补充细节，生成一个代码项目
(⭐️ Amazing) Yidadaa / ChatGPT-Next-Web
好像很实用的 web GUI！一键部署到 Vercel。我找这玩意主要是为了直接用 API 访问 GPT-4，就不用订阅每个月的 ChatGPT Plus 了，后者太贵了，也用不了那么多

Back to Homeland

2023-07-05T12:44:49.000Z

今天，恩施的傍晚似乎没有夕阳。天空是深蓝色，深山上有几只黑鸟掠过。厚重的狗叫声。我闻到一种使我感到悲戚的气味，或许是和过去某种悲戚的回忆联系起来。具体回忆内容倒也记不清。明天我要走了，我说要吃点辣的，去了南边又没有了。结果还是去吃了潮汕牛肉。妈妈今天比较易怒，我看得到她的伤感。

回国，这个词对我来说不算什么，身处国内的时候，自己是没有概念的。只有在国外的时候，才对国内有概念。这一点真是讽刺。国内的时间匆匆而过，以前会为了在国外多待几年而争取，争取到了又雀跃。现在真要待那么多年了，才感到自己正坐在什么飞驰而远去的列车上。

人生在世，奔头这个词，我是想解构它的。人活着不为了什么，活就活了。可是我分明在努力什么，在抓住些什么。嘴硬罢了，谁能超凡脱俗，没点惦念的东西？亲人，向往的某种生活，成就感，这就是我的奔头。只是：停在原地原来是一种幸运，也是一种特权。

奶奶

2023-07-03T18:08:23.000Z

奶奶身体还好，但心态不好。健忘、固执，自闭，悲观。她几乎没有什么盼头了。她觉得活得失败，活得不好。她觉得如今脸上无光。她说：

我最痛恨别人，半熟不熟的人见面，跟你打招呼，问你家里近况怎么样。我很生气，简直想骂回去。可是你骂了吧，人家觉得你是神经病。可是我要怎么回答呢？我这两个儿子，一个工作都丢了，一个身体又那样不好。

我命苦啊。我从小家里穷，也没人管我。13岁就出去工作养活自己了。跟了你爷爷，过的都是苦日子。他一个月三十四块五角钱，二十块钱给他爸妈，雷打不动的。家里饭都吃不起了，也要给。剩下十四块五，十块钱给我，他留四块五。抽最差的烟，走得那么早。

我要管家里所有事。他什么也不管啊，一日三餐，他妈，两个儿子，去河边洗衣服。我还要上班。我每天都好累。

好不容易，日子稍微好一点了。他又走了！我那时简直恨他。

别看我现在这样，现在是我最轻松的日子。共产党给我发钱。所以我说，感谢共产党。我拿自己的钱，过自己的日子！就是记性不好。人活到连自理能力也没有了，还有什么意思。跟你爸说好了，到时候我死在这间屋子了，就火化。我都看得开。

死！我听到就眼泪直湍湍地流。奶奶说，不说这个了。她也抹眼泪。

奶奶一个月领三千多退休工资，自己省吃俭用，只花得了一千多。剩的，存在那，每年寒暑假我和妹妹去看她，她发给我们。

我说爸爸，你帮我把钱退给她。

爸爸说你拿着。她自己花不完，这钱不给你们，她拿着有什么用了？给了她安心。

我走的时候，她送出屋子，送到楼梯口。

我走远了，爸爸说，你回头再给奶奶挥挥手，你看她在阳台上看你呢。

我转头看到一簇花白的头发，远远的，很小。

能不能不要有离别呢？

Solutions to Common Problems in Pytorch3D Rendering

2023-04-23T06:22:24.000Z

Pytorch3D Rendering 的一些疑难杂症

有哪些？

有了相机内参 K，而render又需要NDC坐标系，那要怎么定义相机？
图像的黄蓝色反了？
render 完的图像锯齿很严重？怎么抗锯齿（Antialiasing）？
皮肤表面反光太强，光滑得像镜面一样，怎样更自然？
怎么物体只剩半截，更远的部分似乎被截掉了？
没解决的问题：PBR（physical based rendering）

1. 有了相机内参 K，而render又需要NDC坐标系，那要怎么定义相机？

这里的坑在于，camera本身支持任意坐标系，比如Freihand提供的是screen是224*224的相机坐标系。但是，render是默认NDC坐标系的！也就是normalized coordinate system，x和y是normalized到[-1,1]的。

一开始我直接把相机内参传给PerspectiveCameras，并且定义我的相机screen是224*224，像这样：

1	cameras = PerspectiveCameras(K=ks, image_size=((224,224),))

完全不报错，就是有问题：render 过后没东西在画面上。

解决：

我最后在官方文档找到不起眼的一句：

The PyTorch3D renderer for both meshes and point clouds assumes that the camera transformed points, meaning the points passed as input to the rasterizer, are in PyTorch3D's NDC space.

相机坐标系 -> ndc坐标系 -> 图像坐标系）" />

（世界坐标系 -> 相机坐标系 -> ndc坐标系 -> 图像坐标系）

我一看，原来默认PerspectiveCameras是ndc坐标系的，in_ndc = False by default！

所以解决方法就是：

Screen space camera parameters are common and for that case the user needs to set in_ndc to False and also provide the image_size=(height, width) of the screen, aka the image.

那么加一个参数就好了，可是谁知道这问题困扰了我整整两三天：

1	cameras = PerspectiveCameras(K=ks, in_ndc=False, image_size=((224,224),))

另外，我还找到了如下这个等价方法，是先把内参转到NDC坐标系，再传给PerspectiveCameras。（至于为什么探索到这个方法，在后面问题 3 里可以找到原因…）

def get_ndc_fcl_prp(Ks):
        ndc_fx = Ks[:, 0, 0] * 2 / 224.0
        ndc_fy = Ks[:, 1, 1] * 2 / 224.0
        ndc_px = - (Ks[:, 0, 2] - 112.0) * 2 / 224.0
        ndc_py = - (Ks[:, 1, 2] - 112.0) * 2 / 224.0
        focal_length = torch.stack([ndc_fx, ndc_fy], dim=-1)
        principal_point = torch.stack([ndc_px, ndc_py], dim=-1)
        return focal_length, principal_point

fcl, prp = get_ndc_fcl_prp(Ks)
cameras = PerspectiveCameras(focal_length=-fcl, principal_point=prp)

注意focal_length=-fcl，这个负号是为什么呢？这是另一个坑了哈哈哈哈。

答案是：pytorch3d坐标系的convention和我的相机不一样，它是+X指向左，+Y指向上，+Z指向图像平面外。这其中有个上下左右镜像的关系。

2. 图像的黄蓝色反了？

cv2的图像是BGR（老生常谈了），pytorch3d的是RGB。如果图像的黄蓝色相反了，基本就是这个问题，需要翻转一下，可以用torch的clip(dim=(2,))

3. render 完的图像锯齿很严重？怎么抗锯齿（Antialiasing）？

锯齿就是说像下图这样，物体的边缘很尖锐，像素点粒粒分明！

rand_4_skin_rendered_bad

下面是我抗锯齿处理后的效果，可以看见边缘柔和了很多：

rand_4_skin_rendered

（我真的搞了一周这个问题……看看我的心路历程：

是不是 camera 没有用 NDC，而是直接用224x224的坐标系，导致投影过程有损失？所以我试了先转换成 NDC 坐标系的相机，再render。答案是，没有影响。
是不是 Shader 的参数设置得不对，比如 blur_radius 和 faces_per_pixel 应该调大一些？这其实是一个很直观的想法了，甚至一个有经验的学长看了之后都告诉我应该是这个问题。可是当我疯狂调大这两个参数，发现并没有改变这个问题。blur_radius 只会让物体内部的材质更模糊，但是边缘的锯齿完全没改变。faces_per_pixel更是无益，几乎不影响效果。
是不是图像尺寸太小了（224x224），只能达到这么个效果？我首先测试了调大图像尺寸，到1024x1024，发现锯齿边缘的确是不明显了！可是我又看了相机拍摄的原始图像，虽然是有点模糊，但是不至于这么大的锯齿呀，肯定还有别的问题。）

解决：

终于，在这个issue里找到同样的问题：https://github.com/facebookresearch/pytorch3d/issues/399

解决方案是：

render at a higher resolution and then use average pooling to reduce back to the target resolution

居然这么暴力……不过issue里面有很详细的解释，也能理解，这就是render原理之外需要考虑的事情，甚至算不上什么bug。

代码如下：

import torch.nn.functional as F

aa_factor = 3 # Anti-aliasing factor
raster_settings_soft = RasterizationSettings(
        image_size=224 * aa_factor, 
    )
# ...
images = renderer(mesh)
images = images.permute(0, 3, 1, 2)  # NHWC -> NCHW
images = F.avg_pool2d(images, kernel_size=aa_factor, stride=aa_factor)
images = images.permute(0, 2, 3, 1)  # NCHW -> NHWC

4. 皮肤表面反光太强，光滑得像镜面一样，怎样更自然？

一开始，皮肤 render 出来像这样，跟陶瓷似的，像话吗：

rand_4_skin_rendered_bad2

改进后，效果这样，自然多了：

rand_4_skin_rendered_big

解决：

其实搞清楚材质相关的一些参数就好了。主要来说，这个反光是由这两个量决定的：

specular_color: specular reflectivity of the material，指定镜面反射颜色，在表面有光泽和镜面般的地方看到的颜色。
shininess：定义材质中镜面反射高光的焦点。值通常介于 0 到 1000 之间，较高的值会产生紧密、集中的高光。

注意这里是改物体material的这些参数。虽然lighting也有这些参数定义，但这是关于光源的，和这个反光没有关系。

所以修改很简单：定义materials类，调整specular_color。默认是1,1,1，就是纯白色；调成0.2,0.2,0.2比较适合人的皮肤。

from pytorch3d.renderer import Materials

materials = Materials(
    specular_color=((0.2, 0.2, 0.2),), # 默认是1,1,1，就是纯白色；测试发现调成0.2,0.2,0.2比较适合人的皮肤。
    shininess=30, # 默认值是 64，看上去高光稍微有点聚集了，改成30的话略自然，差别不太明显
)
 renderer_p3d = MeshRenderer(
    rasterizer=MeshRasterizer(),
    shader=HardPhongShader(
        materials=materials,
    ),
)

5. 怎么物体只剩半截，更远的部分似乎被截掉了？

还是一只手的模型，render 出来居然只有半个手背，距离相机更远的部分像是被截断了：

rand_1_skin_rendered_half

改进后，正常的效果应该是这样才对：

rand_1_skin_rendered_full

所以问题出在哪呢？的确是“更远的部分被截掉了”。我找到了RasterizationSettings里有这么一个相关的参数：

z_clip_value: if not None, then triangles will be clipped (and possibly subdivided into smaller triangles) such that z >= z_clip_value. This avoids camera projections that go to infinity as z->0. Default is None as clipping affects rasterization speed and should only be turned on if explicitly needed. See clip.py for all the extra computation that is required.

可是问题不在这个参数上，因为它的默认值就是None，应该在后续都没有影响。

解决：

经过仔细看源码，我发现问题出在SoftPhongShader……具体来说，在shader.py 第138-139行，SoftPhongShader的forward函数里：

1 2	znear = kwargs.get("znear", getattr(cameras, "znear", 1.0)) zfar = kwargs.get("zfar", getattr(cameras, "zfar", 100.0))

居然有一个默认的z范围[1,100]……………………所以其实是我的mesh的scale太大了，再加上相机的dist比较大，整个深度就超过zfar了。所以有两种方法，要么缩小一下mesh的尺度；要么不想改变原数据的话，在render的时候，把znear zfar参数额外传入，如下：

1	images = renderer(mesh, ..., znear=-2.0, zfar=1000.0)

6. 没解决的问题：PBR（physical based rendering）

我的数据中3D mesh的材质用了PBR（physical based rendering）。它提供三张贴图图像：diffuse map，specular map和normal map。

但是pytorch3d目前并不支持PBR inspired shading（see issue）。

所以目前我只能把diffuse map作为一般意义上的texture map，而忽略了specular map和normal map这两张图。

我不确定能不能自己实现这部分功能，比如自定义 phong_shading函数（参考issue）。但这有点超出我的能力范围和精力范围，所以暂时搁置了。如果能实现的话，PyTorch3D 似乎是欢迎contribution的（issue）

Camera projection with the pinhole model

2023-04-02T13:44:43.000Z

A camera is a mapping between the 3D world (object space) and a 2D image.

In general, the camera projection matrix P has 11 degrees of freedom: \[P=K[R\ \ \ t]\]

Component	# DOF	Elements	Known As
K	5	\(f_x, f_y, s,p_x, p_y\)	Intrinsic Parameters; camera calibration matrix
R	3	\(\alpha,\beta,\gamma\)	Extrinsic Parameters
t (or \(\tilde{C}\))	3	\((t_x,t_y,t_z)\)	Extrinsic Parameters

3D world frame ----- R, t ----> 3D camera frame ------ K -----> 2D image

Explanation:

P: Projective camera, maps 3D world points to 2D image points.
K: Camera calibration matrix, 3 x 3, \(x=K[I|0]X_{cam}\), given 3D points in camera coordinate frame \(X_{cam}\), we can project it into 2D points on image \(x\).
R and t: Camera Rotation and Translation, rigid transformation. \(X_{cam}=( X,Y,Z,1)^T\) is expressed in the camera coordinate frame. In general, 3D points are expressed in a different Euclidean coordinate frame, known as the world coordinate frame. The two frames are related via a rigid transformation (R, t).

Some other terms you may see

P: 3x4, homogeneous, camera projection matrix, \(P=diag(f,f,1)[I|0]\). P is K without considering \((x_{cam},y_{cam})\) in the image. (In other words, it simplify \((p_x, p_y)=(0,0)\).

Configure Academic Page in Jekyll + Blog in Hexo together

2023-03-14T06:24:17.000Z

Difficult situations:

The Academic page is powered by Jekyll, while the blog website is powered by Hexo.
And they are maintained in two separated repositories on Github.
Besides [username].github.io, I have a domain jyzhu.top, and want to use my custom domain.
All in all, I hope to visit the academic page is at jyzhu.top, while visit the blog is at jyzhu.top/blog.

Now let's configure.

Rename the blog repo as blog; rename the academic page repo as [username].github.io.
Edit the blog's Hexo config file:
1
2
url: https://jyzhu.top/blog
root: /blog/
While no need to move all the files into a subfolder blog of your repo.

The Jekyll config is simple. Nothing needs to specify.
Edit the Github repo settings. Set the academic repo's custom domain as jyzhu.top. A CNAME file will be automatically added in the root. Now obviously, the jyzhu.top successfully refers to the academic page.

Then you know what, everything is done! Because all other repos with github page turns on, are automatically mapped to subpaths of [username].github.io by Github. Then coz [username].github.io is mapped to [url], everything will be there, including [url]/blog for the blog repo.

齐次坐标系

2023-01-26T10:49:19.000Z

齐次坐标系

之前不理解为什么要用一个和从小到大学的笛卡尔坐标系不同的齐次坐标系来表示东西，并且弄得很复杂；学了各种公式也很糊涂。现在终于明白了

齐次坐标系的现实意义

就是用来表示现实世界中我们眼睛看到的样子：两条平行线在无限远处能相交。

齐次坐标系的本质：

就是用N+1维来代表N维坐标。

也就是说，原本二维空间的点\((X,Y)\)，增加一个维度，用\((x,y,w)\)来表示。把齐次坐标转换成笛卡尔坐标是很简单的，对前两个维度分别除以最后一个维度的值，就好了，即 \[X=\frac x w,\ Y=\frac y w\\ (X,Y)=(\frac x w,\frac y w)\]

这样做就可以表示两条平行线在远处能相交了！why？

要解释这个，需要先解释一个齐次坐标系的特点：规模不变性（也是叫homogeneous这个名字的原因）。也就是说，对任意非零的k，\((x,y,w)\)和\((kx,ky,kw)\)都表示二维空间中同一个点\((\frac x w,\frac y w)\)。（因为\(\frac{kx}{kw}=\frac xw\)嘛。）

首先，用原本笛卡尔坐标系中的表示方法，无限远处的点会被表示成\((\infty,\infty)\)，从而失去意义。但是我们发现用齐次坐标，我们就有了一个方法明确表示无限远处的任意点，即，\((x,y,0)\)。（为什么？因为把它转换回笛卡尔坐标，会得到\((\frac x 0,\frac y 0)=(\infty,\infty)\)）。

现在，用初中所学，联立两条直线的方程，得到的解是两条直线的交点。假如有两条平行线\(Ax+By+C=0\)和\(Ax+By+D=0\)，求交点，则 \[\left\{\matrix{Ax+By+C=0 \\Ax+By+D=0}\right.\] 在笛卡尔坐标系中，可知唯一解是\(C=D\)，即两条线为同一条直线。

但是，如果把它换成齐次坐标，得到 \[\left\{\matrix{A\frac x w+B\frac y w+C=0\\ A\frac x w + B\frac y w +D=0}\right.\]

\[\left\{\matrix{Ax+By+Cw=0\\Ax+By+Dw=0}\right.\]

当\(w=0\)，上式变成\(Ax+By=0\)，得到解\((x,-\frac {A}Bx,0)\)。其实这里的x和y是什么不重要，重要的是w=0，意味着这是个无限远处的点。也就是说，两条平行线在无限远处相交了！甚至能明确求出交点！

Reference:
http://www.songho.ca/math/homogeneous/homogeneous.html
https://zhuanlan.zhihu.com/p/373969867

Introduction to NeRF

2023-01-12T09:33:39.000Z

This is part of my journey of learning NeRF.

1. Introduction to NeRF

What is NeRF

Reference: Original NeRF paper; an online ariticle

在已知视角下对场景进行一系列的捕获 (包括拍摄到的图像，以及每张图像对应的内外参)，合成新视角下的图像。

NeRF 想做这样一件事，不需要中间三维重建的过程，仅根据位姿内参和图像，直接合成新视角下的图像。为此 NeRF 引入了辐射场的概念，这在图形学中是非常重要的概念，在此我们给出渲染方程的定义：

那么辐射和颜色是什么关系呢？简单讲就是，光就是电磁辐射，或者说是振荡的电磁场，光又有波长和频率，\(波长\times 频率=光速\)，光的颜色是由频率决定的，大多数光是不可见的，人眼可见的光谱称为可见光谱，对应的频率就是我们认为的颜色：

Implementation

MLP Structure

The net is constrained to be multi-view consistent by restricting the predicting of \(\sigma\) to be independent of viewing direction
While the color \(\bold c\) depends on both viewing direction and in-scene coordinate.

How is this implemented?

The MLP is designed to be two-stages:

\(F_{\theta_1}(\bold x) = (\sigma, \text{<256 dim features>})\)
\(F_{\theta_2}(\text{<256 dim features>}, \bold d)=\bold c\)

Novel view synthesis

For each pixel, sample points along the camera ray through this pixel;

For each sampling point, compute local color and density;

Use volume rendering, an integral along the camera ray through pixels is used: \[C(\bold r)=\int_{t_1}^{t_2} T(t)\cdot \sigma (\bold r(t))\cdot \bold c(\bold r(t),\bold d)\cdot dt \\T(t)=\exp (-\int_{t_1}^t \sigma(\bold r(u))\cdot du)\] We can get the color C of the pixel.

This can be implemented by sampling approaches.

Now everything can be approximated: \[\hat C(\bold r)=\sum_{i=1}^N \alpha_iT_i\bold c_i \\T_i=\exp (-\sum_{j=1}^{i-1}\sigma_i\delta_j) \\\alpha_i=1-\exp(\sigma_i\delta_i)\\\delta_i=\text{distance between sampling point i and i+1}\]

Loss is just L2 on color of the pixels:

\[L=\sum_{r\in R}\| \hat C(\bold r)-C_{gt}(\bold r)\|^2_2\]

Depth regularization

Similar to the above formulas, expected depth can also be calculated, and can be used to regularize the depth smoothness.

Positional encoding

It is required to greatly improve the fine detail results.

There are many other positional encoding techs, including trainable parametric, integral, and hierarchical variants

SDF - Signed Distance Function

SDF是一种计算图形学中定义距离的函数。SDF定义了空间中的点到隐式曲面的距离，该点在曲面内外决定了其SDF的正负性。

相较于其他像点云（point cloud）、体素（voxel）、面云（mesh）那样的经典3D模型表示方法，SDF有固定的数学方程，更关注物体的表面信息，具有可控的计算成本。

Features of NeRF

Representation can be discrete or continuous. but the discrete representation will be a big one if you have more dimensions, e.g., 3 dim.
- Actually the Plenoxels try to use 3D grids to store the fields. Fast, however, too much memory.
Neural Field has advantages:
1. Compactness 紧致:
2. Regularization: nn itself as inductive bias makes it easy to learn
3. Domain Agonostic: cheap to add a dimension
also problems
- Editability / Manipulability
- Computational Complexity
- Spectral Bias

Problem Formulation

Input: multiview images
Output: 3D Geometry and appearance
Objective:

\[\arg \min_x\|y-F(x)\|+\lambda P(x)\]

y is multiview images, F is forward mapping, x is the desired 3D reconstruction.

F can be differentiable, then you can supervise this.

nn本身就是某种constraints，你就不需要加太多handicraft constraints

Learning NeRF

2023-01-12T09:33:39.000Z

Learning NeRF

This is part of my journey of learning NeRF.

Reading List

Classical

Mildenhall et al. introduced NeRF at ECCV 2020 in the now seminal Neural Radiance Field paper.
This is done by storing the density and radiance in a neural volumetric scene representation using MLPs and then rendering the volume to create new images.
GIRAFFE: Compositional Generative Neural Feature Fields

Survey

Apr 2020 - State of the Art on Neural Rendering

2021CVPR

2021年CVPR还有许多相关的精彩工作发表。例如，提升网络的泛化性：

pixelNeRF：将每个像素的特征向量而非像素本身作为输入，允许网络在不同场景的多视图图像上进行训练，学习场景先验，然后测试时直接接收一个或几个视图为输入合成新视图。
IBRNet：学习一个适用于多种场景的通用视图插值函数，从而不用为每个新的场景都新学习一个模型才能渲染；且网络结构上用了另一个时髦的东西 Transformer。
MVSNeRF：训练一个具有泛化性能的先验网络，在推理的时候只用3张输入图片就重建一个新的场景。

针对动态场景的NeRF:

Nerfies：多使用了一个多层感知机来拟合形变的SE(3) field，从而建模帧间场景形变。Nerfies: Deformable Neural Radiance Fields
D-NeRF：多使用了一个多层感知机来拟合场景形变的displacement。
Neural Scene Flow Fields：多提出了一个scene flow fields来描述时序的场景形变。

其他创新点：

PhySG：用球状高斯函数模拟BRDF（高级着色的上古神器）和环境光照，针对更复杂的光照环境，能处理非朗伯表面的反射。
NeX：用MPI（Multi-Plane Image ）代替NeRF的RGBσ作为网络的输出。

2022 CVPR

Zero-Shot Text-Guided Object Generation with Dream Fields

Useful References:

NeRF at ECCV22 - Mark Boss
NeRF at NeurIPS 2022 - Mark Boss
NeRF at CVPR 2022 - Frank Dellaert
CVPR 2022 Tutorial on Neural Fields in Computer Vision

Bigger to learn:

[ ] Above NeRF: neural rendering
[ ] Related theories in graphics and computer vision
[ ] NeRF的一作Ben Mildenhall在SIGGRAPH 2021 Course Advances in Neural Rendering中从概率的角度推导了NeRF的体渲染公式。

Manipulate Neural Fields

2023-01-12T09:33:39.000Z

This is part of my journey of learning NeRF.

2.5. Manipulate Neural Fields

Neural fields is ready to be a prime representation, similar as point clouds or meshes, that is able to be manipulated.

image-20221212211525928

You can either edit the input coordinates, or edit the parameters \(\theta\).

On the other axis, you can edit through an explicit geometry, or an implicit neural fields.

image-20221212213802209

The following examples 落在不同的象限。

Editing the input via Explicit geometry (left-up)

You can represent each object using a separated neural field (local frame), and then compose them together in different ways.
If you want to manipulate not only spatially, but also temporaly, it is also possible. You can add a time coordinate as the input of the neural field network, and transform the time input.
You can also manipulate (especially human body) via skeleton.
image-20221212212838893
- Beyond human, we can also first estimate different moving parts of an object, to form some skeleton structure, and then do the same.
  Noguchi etal, CVPR22
Beyond rigid, we can also manipulate via mesh. coz we have plenty of manipulation tools on mesh. The deformation on mesh can be re-mapped as the deformation on the input coordinate
image-20221212213601773

Editing the input via Neural Flow Fields (left-down)

image-20230104183222294

We use the \(f_{i\rightarrow j}\) to edit the \(r_{i\rightarrow j}\) to represent one ray into another one.

We need to define the consistency here, so that the network can learn through forward and backward:

image-20230104183453487

Editing network parameters via Explicit geometry (right-up)

The knowledge is already in the network. So instead of editing the inputs, we can directly edit the network parameters for generating new things.

image-20230104185014312

This proposed solution makes use of an encoder. The encoder learns to represent the rotated input as a high-dimensional latent code Z, with the same rotation R, in 3-dim space. The the following network use the latent code to generate the \(f_\theta\)

image-20230104185544623

In this work, the key idea is to map the high-resolutional object and the similar but lower resolutional object into the same latent space. Then, you can easily manipulate the lower resolutional object, and it should also affect the higher resolutional one. Then, the shared latent space are put into the following neural field network, which outputs high resolutional results.

image-20230104202425695

image-20230104202625346

This work (Yang et al. NeurlPS'21) about shape editing is "super important" but the speaker does not have enough time... Basically it shows that the tools that we use to manipulate a mesh can also be used on a neural field, where we can keep some of the network parameters to make sure the basic shape of the object the same, and then the magical thing is the "curvature manipulation" item. Given the neural field is differentiable, this can be achieved.

image-20230104203311551

Obeying the points (a.k.a generalization). It makes sure the manipulation done on the input points are reconstructed.

Editing network parameters via Neural Fields (right-down)

image-20230104204330741

This work constructs a reasonable latent space of the object, then do interpolation of different objects.
Beyond geometry, we can also manipulate color

image-20230104204738067

It decomposes the network into shape and color networks, and we can edit each independently.

image-20230104204937204

This is the stylization work. It mainly depends on a different loss function, which does not search for the exact feature of the vgg, but somehow the nearest neighbor.

NeRF Differentiable Forward Maps

2023-01-12T09:33:39.000Z

This is part of my journey of learning NeRF.

2.3. Differentiable Forward Maps

image-20221208175453557

Differentiable rendering

image-20221208181457315

Volume rendering can render fogs. Sphere rendering only render the solid surface, and needs ground truth supervision.? Neural renderer combines the two.

Differentiability of the rendering function itself

BRDF Shading? details later.

Differentiation itself

Design a neural network with higher order derivatives constraints and therefore directly use its derivative.

image-20221208182302568

For example the Eikonal equation forces the neural network has a derivative as 1. Adding the eikonal loss then promises the neural network valid.

Generally, this kind of problems are: the solutions are constrained by its partial derivatives.

Special: Identity Operator

\[\text{Reconstruction} \rightarrow \hat 1()\rightarrow \text{Sensor domain}\\\text{Reconstruction} == \text{Sensor domain}\]

Q&A:

Can we obtain a neural network in just one forward, without optimization?
Can we design special forward maps for specific downstream tasks, eg., classification? Absolutely yes. We can design it to represent a compact representation as the sensor domain. The key idea is to get a differentiable function to map your specific recon and sensor domain.

NeRF Hybrid representations

2023-01-12T09:33:39.000Z

This is part of my journey of learning NeRF.

2.2. Hybrid representations

Tradeoffs of choosing a proper representation

image-20221208172055153

image-20221208172209556

You may choose one proper representation depending on your own application

1. Grid

image-20221205195659841

Input is too huge. Then you need too huge neural network. So, this grid interpolation acts like a "position encoding", which encodes the low dimensional features into high dims.

image-20221208162026398

NeRFusion CVPR22: online!

2. point cloud

image-20221208162541770

Cons:

To access local points, you need to specifically design the data structure. Otherwise, it is O(n)!
Choose different kernels to retrieve nearby points' features. Oftentimes you assume it is local kernel.

3. Mesh

Unstructed grids. Compared with point clouds, meshes have connectivity info.

image-20221208163526289

image-20221208163746237

4. Multiplanar Images

Something like project a 3D grid into an axis to get levels of planes.

image-20221208164038729

Pros:

Compact
Very efficient because the hardware and software designs are accelerated to these 2D operations, like bi-linear operations.

Cons:

Resolution bias on plane axis: coz it is discrete betweens planes.

This is not very wise in my opinion. It is just a temporary tradeoff given nowadays' technologies. Coz everything will be 3D in the future.

Generate 2D images from different camera views (perhaps). Key point is the tri-plane representation of 3D features.

5. Multiresolution grids

image-20221208165714329

Pros:

Stable coz you indeed need both low and high resolution info

6. Hash grids

\[[x,y,z]\text{ coordinates}\rightarrow \text{Hash function()} \rightarrow \text{Fixed size codebook}\] Pros:

No matter how big is the original data, you can use a fixed size codebook as the input feature.
Can be online!

Cons:

May still need large codebooks
Features not spatially local. I don't think the hash grid is a good idea if this drawback exists. But isn't there a simple way to generate features with local info remaining?

7. Codebook grids

image-20221208170955887

Instead of storing features of points in grids, store a (index to a) code in a codebook. The size of the codebook is fixed, so the overall size can be controlled as much smaller.

cons:

To make the indexing operation differentiable, the computing complexity rises here.
Using hash is to get rid of the complex data structure, but the indices bring it back.

8. Bounding Volume Hierarchies

image-20221208171806113

Commonly used method in computer graphics

9. Others (voxel)

image-20221208173124734

For dynamic nerfs, is there any better hybrid representation? Sure.
Is there any explicit bias of these hybird representations that we can discover and then design regularization? Sure.

Tianke Youke

Where is my moveable feast

look at my new keyboard!!

Borges and AI 读后感

Score-based generative model

Score-based models v.s. diffusion models

Score function

关 noise 什么事？

diffusion models

Reading Learning Locally Editable Virtual Humans

Why：

What：

How：

Reading Neural Capture of Animatable 3D Human from Monocular Video

Why：

What：

How：

Query embedding for NeRF

训练过程

实验

KL散度 （Kullback-Leibler Divergence，相对熵）

含义

定义

理解熵

KL散度的性质

讨论

信息的压缩

Reading Handy: Towards a high fidelity 3D hand shape and appearance model

Why：

What：

How：

1. 收集large-scale数据集

2. 形状重建

3. 高分辨率外观模型

实验

ChatGPT Applications to be explored

Back to Homeland

奶奶

Solutions to Common Problems in Pytorch3D Rendering

Pytorch3D Rendering 的一些疑难杂症

1. 有了相机内参 K，而render又需要NDC坐标系，那要怎么定义相机？

解决：

2. 图像的黄蓝色反了？

3. render 完的图像锯齿很严重？怎么抗锯齿（Antialiasing）？

解决：

4. 皮肤表面反光太强，光滑得像镜面一样，怎样更自然？

解决：

5. 怎么物体只剩半截，更远的部分似乎被截掉了？

解决：

6. 没解决的问题：PBR（physical based rendering）

Camera projection with the pinhole model

Some other terms you may see

Configure Academic Page in Jekyll + Blog in Hexo together

Difficult situations:

Now let's configure.

齐次坐标系

齐次坐标系

齐次坐标系的现实意义

齐次坐标系的本质：

Introduction to NeRF

1. Introduction to NeRF

What is NeRF

Implementation

MLP Structure

Novel view synthesis

Depth regularization

Positional encoding

SDF - Signed Distance Function

Features of NeRF

Problem Formulation

Learning NeRF

Learning NeRF

Reading List

Classical

Survey

2021CVPR

2022 CVPR

Useful References:

Manipulate Neural Fields

2.5. Manipulate Neural Fields

KL散度（Kullback-Leibler Divergence，相对熵）