NLP

Introducing the Hebrew Wikipedia Sentences Corpus

By Dr. Tom RonFebruary 14, 20265 min read

Hebrew is one of those languages that remains underserved in NLP. While English datasets are abundant, researchers and practitioners working with Hebrew often face a frustrating gap: high-quality, large-scale sentence-level corpora are hard to come by. I decided to do something about it.

The Dataset

The Hebrew Wikipedia Sentences Corpus is a collection of 10,999,257 cleaned, deduplicated Hebrew sentences, extracted from 366,610 Hebrew Wikipedia articles. The entire dataset is 1.8 GB in Parquet format and released under CC BY-SA 3.0.

Each sentence comes with rich metadata:

  • Article context — the source article ID, title, and Wikipedia categories
  • Position tracking — where the sentence appeared within its article
  • Quality signals — word count and Hebrew character ratio

The sentences average 16.6 words, with a median Hebrew ratio of 1.0 (fully Hebrew text). Every sentence is between 5 and 50 words, ensuring a useful range for most NLP tasks.

How It Was Built

The pipeline has three stages:

  1. Crawl — All Hebrew Wikipedia articles were fetched via the MediaWiki API
  2. Extract — Wikitext was converted to plain text and split into sentences, with filtering for length (5–50 words), Hebrew character ratio (≥50%), and content quality
  3. Deduplicate — Exact duplicates were removed using SHA-256 hashing

This approach prioritizes clean, usable data over raw volume. Wikipedia's encyclopedic register means the text is well-structured and grammatically sound, though it skews formal.

Use Cases

This corpus is designed to support a range of Hebrew NLP tasks:

  • Language model pretraining and fine-tuning — 11M sentences provide substantial training signal
  • Text classification — Article categories enable supervised and semi-supervised approaches
  • Sentence similarity and semantic search — Clean, deduplicated sentences make strong training pairs
  • Named Entity Recognition — Wikipedia text is rich with named entities
  • Benchmarking — A standardized corpus for comparing Hebrew NLP models

Getting Started

from datasets import load_dataset

ds = load_dataset("tomron87/hebrew-wikipedia-sentences-corpus")
print(f"Total sentences: {len(ds['train']):,}")
print(ds["train"][0])

Limitations Worth Noting

The corpus reflects Wikipedia's characteristics: formal register, encyclopedic tone, and uneven topic coverage driven by editor demographics. It does not include spoken Hebrew, social media language, or informal writing. It is a snapshot from February 2026 and will not update automatically.

For applications requiring colloquial Hebrew or domain-specific language (medical, legal, etc.), this corpus works best as a foundation to supplement with targeted data.

Access the Dataset

The full dataset is available on Hugging Face: tomron87/hebrew-wikipedia-sentences-corpus

If you use this dataset in your research, please cite it:

@dataset{hebrew_wikipedia_sentences,
  title = {Hebrew Wikipedia Sentences},
  author = {Tom Ron},
  year = {2026},
  url = {https://huggingface.co/datasets/tomron87/hebrew-wikipedia-sentences-corpus},
  license = {CC BY-SA 3.0}
}

I'd love to hear what you build with it.