The AI for Asian Studies project is a joint initiative by the Harvard Digital China Initiative and the China Biographical Database Project. The project aims to collect the most useful large language model projects for East Asian Studies. It is currently maintained by Kwok-leong Tang, Hongsu Wang, Wenxin Xiao, and Helen He. If you have any suggestions or comments, please don’t hesitate to contact the primary maintainer of this website, Hongsu Wang, at hongsuwang(at)fas.harvard.edu.
nvidia/NVIDIA-Nemotron-Parse-v1.1
https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
NVIDIA Nemotron Parse v1.1 is designed to understand document semantics and extract text and tables elements with spatial grounding. Given an image, NVIDIA Nemotron Parse v1.1 produces structured annotations, including formatted text, bounding-boxes and the corresponding semantic classes, ordered according to the document's reading flow. It overcomes the shortcomings of traditional OCR technologies that struggle with complex document layouts with structural variability, and helps transform unstructured documents into actionable and machine-usable representations. This has several downstream benefits such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of extractor, curator, retriever and AI agentic applications, and enhancing document understanding pipelines.
deepseek-ai/DeepSeek-OCR
https://huggingface.co/deepseek-ai/DeepSeek-OCR
https://arxiv.org/abs/2510.18234
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

Qwen/Qwen3-VL-8B-Instruct
https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.
This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.