Introducing the Hebrew Wikipedia Sentences Corpus

Hebrew is one of those languages that remains underserved in NLP. While English datasets are abundant, researchers and practitioners working with Hebrew often face a frustrating gap: high-quality, large-scale sentence-level corpora are hard to come by. I decided to do something about it.

The Dataset

The Hebrew Wikipedia Sentences Corpus is a collection of 10,999,257 cleaned, deduplicated Hebrew sentences, extracted from 366,610 Hebrew Wikipedia articles. The entire dataset is 1.8 GB in Parquet format and released under CC BY-SA 3.0.

Each sentence comes with rich metadata:

Article context — the source article ID, title, and Wikipedia categories
Position tracking — where the sentence appeared within its article
Quality signals — word count and Hebrew character ratio

The sentences average 16.6 words, with a median Hebrew ratio of 1.0 (fully Hebrew text). Every sentence is between 5 and 50 words, ensuring a useful range for most NLP tasks.

How It Was Built

The pipeline has three stages:

Crawl — All Hebrew Wikipedia articles were fetched via the MediaWiki API
Extract — Wikitext was converted to plain text and split into sentences, with filtering for length (5–50 words), Hebrew character ratio (≥50%), and content quality
Deduplicate — Exact duplicates were removed using SHA-256 hashing

This approach prioritizes clean, usable data over raw volume. Wikipedia's encyclopedic register means the text is well-structured and grammatically sound, though it skews formal.

Use Cases

This corpus is designed to support a range of Hebrew NLP tasks:

Language model pretraining and fine-tuning — 11M sentences provide substantial training signal
Text classification — Article categories enable supervised and semi-supervised approaches
Sentence similarity and semantic search — Clean, deduplicated sentences make strong training pairs
Named Entity Recognition — Wikipedia text is rich with named entities
Benchmarking — A standardized corpus for comparing Hebrew NLP models

Getting Started

from datasets import load_dataset

ds = load_dataset("tomron87/hebrew-wikipedia-sentences-corpus")
print(f"Total sentences: {len(ds['train']):,}")
print(ds["train"][0])

Limitations Worth Noting

The corpus reflects Wikipedia's characteristics: formal register, encyclopedic tone, and uneven topic coverage driven by editor demographics. It does not include spoken Hebrew, social media language, or informal writing. It is a snapshot from February 2026 and will not update automatically.

For applications requiring colloquial Hebrew or domain-specific language (medical, legal, etc.), this corpus works best as a foundation to supplement with targeted data.

Access the Dataset

The full dataset is available on Hugging Face: tomron87/hebrew-wikipedia-sentences-corpus

If you use this dataset in your research, please cite it:

@dataset{hebrew_wikipedia_sentences,
  title = {Hebrew Wikipedia Sentences},
  author = {Tom Ron},
  year = {2026},
  url = {https://huggingface.co/datasets/tomron87/hebrew-wikipedia-sentences-corpus},
  license = {CC BY-SA 3.0}
}

I'd love to hear what you build with it.