The first open-source framework for holistic, structured repository-level documentation across multilingual codebases
Developers spend nearly 58% of their time understanding codebases, yet maintaining comprehensive documentation remains challenging. While recent Large Language Models (LLMs) show promise for function-level documentation, they fail at the repository level, where capturing architectural patterns and cross-module interactions is essential.
CodeWiki is the first open-source framework for holistic repository-level documentation across seven programming languages, introducing innovations in hierarchical decomposition, recursive agentic processing, and multi-modal synthesis.
Figure 1: CodeWiki Framework operates in three main phases: (1) Repository analysis and hierarchical decomposition, (2) Recursive documentation generation with dynamic delegation, (3) Hierarchical assembly and synthesis
Dynamic programming-inspired strategy that breaks complex repositories into manageable modules while preserving architectural coherence. Handles codebases from 86K to 1.4M lines of code.
Multi-agent architecture with dynamic delegation capabilities that enables adaptive processing based on module complexity, maintaining quality at repository-level scope.
Generates comprehensive documentation including textual descriptions, architecture diagrams, data flows, and sequence diagrams for holistic understanding.
Evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment across 21 repositories.
| Language Category | CodeWiki (Sonnet-4) | DeepWiki | Improvement |
|---|---|---|---|
| High-Level (Python, JS, TS) | 79.14% | 68.67% | +10.47% |
| Managed (C#, Java) | 68.84% | 64.80% | +4.04% |
| Systems (C, C++) | 53.24% | 56.39% | -3.15% |
| Overall Average | 68.79% | 64.06% | +4.73% |
| Repository | Language | LOC | CodeWiki | DeepWiki | Improvement |
|---|---|---|---|---|---|
| All-Hands-AI--OpenHands | Python | 229K | 82.45% | 73.04% | +9.41% |
| puppeteer--puppeteer | TypeScript | 136K | 83.00% | 64.46% | +18.54% |
| sveltejs--svelte | JavaScript | 125K | 71.96% | 68.51% | +3.45% |
| Unity-Technologies--ml-agents | C# | 86K | 79.78% | 74.80% | +4.98% |
| elastic--logstash | Java | 117K | 57.90% | 54.80% | +3.10% |
View comprehensive results for all 21 repositories in our paper.
Watch CodeWiki in action as it generates comprehensive documentation for a real repository:
CLI Usage Example: Generating documentation with CodeWiki
# Install from source
pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git
# Verify installation
codewiki --version
1. Configure CodeWiki:
codewiki config set \
--api-key YOUR_API_KEY \
--base-url https://api.anthropic.com \
--main-model claude-sonnet-4 \
--cluster-model claude-sonnet-4
# Verify configuration
codewiki config show
codewiki config validate
2. Generate Documentation:
# Navigate to your project
cd /path/to/your/project
# Generate documentation (saved to ./docs/)
codewiki generate
# Generate with GitHub Pages HTML viewer
codewiki generate --github-pages
# Full-featured generation
codewiki generate --create-branch --github-pages --verbose
./docs/
├── overview.md # Repository overview (start here!)
├── module1.md # Module documentation
├── module2.md # Additional modules...
├── module_tree.json # Hierarchical module structure
├── first_module_tree.json # Initial clustering result
├── metadata.json # Generation metadata
└── index.html # Interactive viewer (with --github-pages)
If you use CodeWiki in your research, please cite our paper:
@misc{hoang2025codewikievaluatingaisability,
title={CodeWiki: Evaluating AI's Ability to Generate Holistic
Documentation for Large-Scale Codebases},
author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le
and Nghi D. Q. Bui},
year={2025},
eprint={2510.24428},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2510.24428},
}