Skip to content

Releases: chonkie-ai/chonkie

v0.4.1

07 Jan 13:22
e9440ac
Compare
Choose a tag to compare

Highlights

  • Now you can see a progress bar when chunking a lot of texts with batch chunking
from chonkie import RecursiveChunker

chunker  = RecursiveChunker()

chunks = chunker([...], show_progress_bar=True)    # progress bar is enabled by default

# 🦛 choooooooooooooooooooonk 100% • 200/200 docs chunked [00:00<00:00, 229.65doc/s] 🌱

What's Changed

  • Add CONTRIBUTING.md, update issue templates, CI, Codecov and more... by @bhavnicksm in #119
  • [FEAT] Add TQDM to default installs + CONTRIBUTING.md + other minor updates by @bhavnicksm in #120
  • [fix] CI: reports were not being uploaded to Codecov by @bhavnicksm in #121
  • Update CONTRIBUTING.md with first issue hyperlink by @shreyashnigam in #122
  • [FIX] Support class methods as token_counter objects for CustomEmbeddings (#92) by @bhavnicksm in #127
  • [Fix] Add fix for #92: Support class.method as a Tokenizer for CustomEmbedding +. minor changes by @bhavnicksm in #128
  • [FIX] #116: Incorrectstart_index when chunk_overlap is not 0 by @Udayk02 in #126
  • [FIX] start_index incorrect when chunk_overlap is not 0 (#116) by @bhavnicksm in #132
  • [FIX] Remove tests for Py3.8 — Incompatible for support by @bhavnicksm in #134
  • [fix] High chunk_overlap causes last chunk to be entirely redundant by @bhavnicksm in #136
  • [FIX] Handle edge case for RecursiveChunker (#131) by @bhavnicksm in #137
  • [DOCS] Update readme intro to match docs. by @shreyashnigam in #135
  • [FEAT] Add TQDM progress bars for chunk_batch + Update README.md by @bhavnicksm in #138
  • Replace dead discord link with infinite lifetime by @shreyashnigam in #140
  • [FIX] Minor fixes + Stylistic enhancements for TQDM and Multiprocessing by @bhavnicksm in #141
  • [chore] Bump up the package version to v0.4.1 by @bhavnicksm in #143

New Contributors

Full Changelog: v0.4.0...v0.4.1

v0.4.0

29 Dec 00:19
6f2fe07
Compare
Choose a tag to compare

Highlights

Added the new RecursiveChunker that uses complex recursive rules to create structurally meaningful chunks, maintaining natural separations as much as possible. Try it out~

from chonkie import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker("Woah! Chonkie has it's own recursive chunker now~ so cooool!")

What's Changed

  • Add initial support for Recursive Chunking (RecursiveChunker) by @bhavnicksm in #107
  • [FEAT] Add support for RecursiveChunking + minor fixes by @bhavnicksm in #108
  • [fix] Correct the start and end indices for TokenChunker in Batch mode (#84) by @bhavnicksm in #109
  • [fix] Correct the start and end indices for TokenChunker in Batch mode (#84) by @bhavnicksm in #110
  • [fix] #106: Missing last sentence in the SemanticChunker by @bhavnicksm in #112
  • [fix] Add fix for #106: Reconstruction tests for SemanticChunker failing, missing last sentence by @bhavnicksm in #113
  • [chore] Bump version to "v0.4.0" + minor change by @bhavnicksm in #114

Full Changelog: v0.3.0...v0.4.0

v0.3.0

23 Dec 18:44
345ca6c
Compare
Choose a tag to compare

Highlights

  • Added LateChunker support! You can use LateChunker in the following manner:
from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="jinaai/jina-embeddings-v3",
    mode="sentence", 
    trust_remote_code=True
)
  • Added Chonkie Discord to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on Twitter and Bluesky too!
  • Bunch of bug fixes to improve chunkers' stability...

What's Changed

  • [Fix] #37: Incorrect indexing when repetition is present in the text by @bhavnicksm in #87
  • [Fix] #88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by @arpesenti in #89
  • [Fix] WordChunker chunk_batch fail by @sky-2002 in #90
  • [FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by @bhavnicksm in #96
  • Add initial support for Late Chunking by @bhavnicksm in #97
  • [FEAT] Add LateChunker by @bhavnicksm in #98
  • [FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by @bhavnicksm in #99
  • Update version to 0.3.0 in pyproject.toml and init.py by @bhavnicksm in #100
  • [fix] Add LateChunker support to chunker and module exports by @bhavnicksm in #101
  • [fix] Docstrings in SemanticChunker should include **kwargs by @bhavnicksm in #102
  • [Minor] Add Discord badge to README for community engagement by @bhavnicksm in #103

New Contributors

Full Changelog: v0.2.2...v0.3.0

v0.2.2

06 Dec 22:56
475f08d
Compare
Choose a tag to compare

Highlights

  • Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
  • Added auto thresholding mode for SemanticChunkers to remove similarity_threshold hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum
  • Added OverlapRefinery for adding overlap context to the chunks. chunk_overlap parameter will be deprecated in the future for OverlapRefinery instead.

What's Changed

  • [Fix] AutoEmbeddings not loading all-minilm-l6-v2 but loads All-MiniLM-L6-V2 by @bhavnicksm in #57
  • [Fix] Add fix for #55 by @bhavnicksm in #58
  • [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
  • [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
  • [Update] Change default embedding model in SemanticChunkers by @bhavnicksm in #63
  • Add min_chunk_size to SDPMChunker + Lint codebase with ruff + minor changes by @bhavnicksm in #68
  • Added automated testing using Github Actions by @pratyushmittal in #66
  • Add support for automated testing with Github Actions by @bhavnicksm in #69
  • [Fix] Allow for functions as token_counters in BaseChunkers by @bhavnicksm in #70
  • Add TEVL to speed up sentence chunker by @bhavnicksm in #71
  • Add TEVL to speed-up sentence chunking by @bhavnicksm in #72
  • Update the docs path to docs.chonkie.ai by @bhavnicksm in #75
  • [FEAT] Add BaseRefinery and OverlapRefinery support by @bhavnicksm in #77
  • Add support for BaseRefinery and OverlapRefinery + minor changes by @bhavnicksm in #78
  • [FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by @bhavnicksm in #79
  • [Fix] Unify dataclasses under a types.py for ease by @bhavnicksm in #80
  • Expose the seperation delim for simple multilingual chunking by @bhavnicksm in #81
  • Bump version to v0.2.2 for release by @bhavnicksm in #82

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1.post1

24 Nov 14:41
7b1e480
Compare
Choose a tag to compare

Highlights

This patch fix allows for AutoEmbeddings to properly default to SentenceTransformerEmbeddings which was being by-passed in the previous release.

Furthermore, because of reconstructable splitting, numerous smaller sentences were making it through to the SemanticChunker. To subvert the issue, this fix introduces a min_chunk_size which takes in the minimum tokens that need to be in a chunk. This solves the issues in the tests.

What's Changed

  • [Fix] AutoEmbeddings not loading all-minilm-l6-v2 but loads All-MiniLM-L6-V2 by @bhavnicksm in #57
  • [Fix] Add fix for #55 by @bhavnicksm in #58
  • [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
  • [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62

Full Changelog: v0.2.1...v0.2.1.post1

v0.2.1

22 Nov 10:40
f5768e8
Compare
Choose a tag to compare

Breaking Changes

  • SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the SentenceTransformerEmbeddings class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside the AutoEmbeddings class.
  • By default, semantic optional installation now depends on Model2VecEmbeddings and hence model2vec python package from this release onwards, due to size and speed benefits. Model2Vec uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency.
  • SemanticChunker and SDPMChunker now use the argument chunk_size instead of max_chunk_size for uniformity across the chunkers, but the internal representation remains the same.

What's Changed

  • [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
  • [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
  • Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
  • Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
  • Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
  • [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
  • Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
  • Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
  • [DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
  • [FEAT] - Add model2vec embedding models by @sky-2002 in #41
  • [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
  • [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
  • [Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
  • [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
  • [Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0.post1

18 Nov 09:32
6227b48
Compare
Choose a tag to compare

Highlights

This patch was added to fix support for python3.9 with Dataclass slots. Earlier we were using slots=True which would only work for python 3.10 onwards. This also works in python3.10+ versions.

What's Changed

  • Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
  • Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35

Full Changelog: v0.2.0.post1...v0.2.0.post2

v0.2.0

17 Nov 13:27
0dd5ecb
Compare
Choose a tag to compare

Breaking Changes

  • Semantic Chunkers will no longer be taking in an additional tokenizer object, and would instead infer the tokenizer from the embedding_model passed. This is done to ensure that the tokenizer and embedding_model token counts always match as that is a necessary condition for some of the optimizations on them.

What's Changed

  • Update Docs by @bhavnicksm in #14
  • Update acknowledgements in README.md for improved clarity and appreci… by @bhavnicksm in #15
  • Update README.md + fix DOCS.md typo by @bhavnicksm in #17
  • Remove Spacy dependency from Chonkie by @bhavnicksm in #20
  • Remove Spacy dependency from 'sentence' install + Add FAQ to DOCS.md by @bhavnicksm in #21
  • Update README.md + minor updates by @bhavnicksm in #22
  • fix: tokenizer mismatch for SemanticChunker + Add BaseEmbeddings by @bhavnicksm in #24
  • Update dependency version of SentenceTransformer to at least 2.3.0 by @bhavnicksm in #27
  • Add initial batching support via chunk_batch fn + update DOCS by @bhavnicksm in #28
  • [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
  • [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
  • Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32

New Contributors

Full Changelog: v0.1.2...v0.2.0.post1

v0.1.2

08 Nov 17:26
745e5e8
Compare
Choose a tag to compare

What's Changed

  • Make imports as a part of Chunker init instead of file imports to make Chonkie import faster by @bhavnicksm in #12
  • Run Black + Isort + beautify the code a bit by @bhavnicksm in #13

Full Changelog: v0.1.1...v0.1.2

v0.1.1

07 Nov 19:06
3b1fa22
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.0...v0.1.1