Releases: chonkie-ai/chonkie
v0.4.1
Highlights
- Now you can see a progress bar when chunking a lot of texts with batch chunking
from chonkie import RecursiveChunker
chunker = RecursiveChunker()
chunks = chunker([...], show_progress_bar=True) # progress bar is enabled by default
# 🦛 choooooooooooooooooooonk 100% • 200/200 docs chunked [00:00<00:00, 229.65doc/s] 🌱
What's Changed
- Add CONTRIBUTING.md, update issue templates, CI, Codecov and more... by @bhavnicksm in #119
- [FEAT] Add TQDM to default installs + CONTRIBUTING.md + other minor updates by @bhavnicksm in #120
- [fix] CI: reports were not being uploaded to Codecov by @bhavnicksm in #121
- Update CONTRIBUTING.md with first issue hyperlink by @shreyashnigam in #122
- [FIX] Support class methods as
token_counter
objects forCustomEmbeddings
(#92) by @bhavnicksm in #127 - [Fix] Add fix for #92: Support
class.method
as a Tokenizer forCustomEmbedding
+. minor changes by @bhavnicksm in #128 - [FIX] #116: Incorrect
start_index
whenchunk_overlap
is not 0 by @Udayk02 in #126 - [FIX]
start_index
incorrect whenchunk_overlap
is not 0 (#116) by @bhavnicksm in #132 - [FIX] Remove tests for Py3.8 — Incompatible for support by @bhavnicksm in #134
- [fix] High
chunk_overlap
causes last chunk to be entirely redundant by @bhavnicksm in #136 - [FIX] Handle edge case for RecursiveChunker (#131) by @bhavnicksm in #137
- [DOCS] Update readme intro to match docs. by @shreyashnigam in #135
- [FEAT] Add TQDM progress bars for
chunk_batch
+ Update README.md by @bhavnicksm in #138 - Replace dead discord link with infinite lifetime by @shreyashnigam in #140
- [FIX] Minor fixes + Stylistic enhancements for TQDM and Multiprocessing by @bhavnicksm in #141
- [chore] Bump up the package version to v0.4.1 by @bhavnicksm in #143
New Contributors
- @shreyashnigam made their first contribution in #122
- @Udayk02 made their first contribution in #126
Full Changelog: v0.4.0...v0.4.1
v0.4.0
Highlights
Added the new RecursiveChunker
that uses complex recursive rules to create structurally meaningful chunks, maintaining natural separations as much as possible. Try it out~
from chonkie import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker("Woah! Chonkie has it's own recursive chunker now~ so cooool!")
What's Changed
- Add initial support for Recursive Chunking (
RecursiveChunker
) by @bhavnicksm in #107 - [FEAT] Add support for RecursiveChunking + minor fixes by @bhavnicksm in #108
- [fix] Correct the start and end indices for TokenChunker in Batch mode (#84) by @bhavnicksm in #109
- [fix] Correct the start and end indices for TokenChunker in Batch mode (#84) by @bhavnicksm in #110
- [fix] #106: Missing last sentence in the SemanticChunker by @bhavnicksm in #112
- [fix] Add fix for #106: Reconstruction tests for SemanticChunker failing, missing last sentence by @bhavnicksm in #113
- [chore] Bump version to "v0.4.0" + minor change by @bhavnicksm in #114
Full Changelog: v0.3.0...v0.4.0
v0.3.0
Highlights
- Added
LateChunker
support! You can useLateChunker
in the following manner:
from chonkie import LateChunker
chunker = LateChunker(
embedding_model="jinaai/jina-embeddings-v3",
mode="sentence",
trust_remote_code=True
)
- Added Chonkie Discord to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on Twitter and Bluesky too!
- Bunch of bug fixes to improve chunkers' stability...
What's Changed
- [Fix] #37: Incorrect indexing when repetition is present in the text by @bhavnicksm in #87
- [Fix] #88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by @arpesenti in #89
- [Fix] WordChunker chunk_batch fail by @sky-2002 in #90
- [FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by @bhavnicksm in #96
- Add initial support for Late Chunking by @bhavnicksm in #97
- [FEAT] Add LateChunker by @bhavnicksm in #98
- [FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by @bhavnicksm in #99
- Update version to 0.3.0 in pyproject.toml and init.py by @bhavnicksm in #100
- [fix] Add LateChunker support to chunker and module exports by @bhavnicksm in #101
- [fix] Docstrings in SemanticChunker should include **kwargs by @bhavnicksm in #102
- [Minor] Add Discord badge to README for community engagement by @bhavnicksm in #103
New Contributors
- @arpesenti made their first contribution in #89
Full Changelog: v0.2.2...v0.3.0
v0.2.2
Highlights
- Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
- Added
auto
thresholding mode for SemanticChunkers to removesimilarity_threshold
hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum - Added
OverlapRefinery
for adding overlap context to the chunks.chunk_overlap
parameter will be deprecated in the future forOverlapRefinery
instead.
What's Changed
- [Fix] AutoEmbeddings not loading
all-minilm-l6-v2
but loadsAll-MiniLM-L6-V2
by @bhavnicksm in #57 - [Fix] Add fix for #55 by @bhavnicksm in #58
- [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
- [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
- [Update] Change default embedding model in SemanticChunkers by @bhavnicksm in #63
- Add
min_chunk_size
to SDPMChunker + Lint codebase with ruff + minor changes by @bhavnicksm in #68 - Added automated testing using Github Actions by @pratyushmittal in #66
- Add support for automated testing with Github Actions by @bhavnicksm in #69
- [Fix] Allow for functions as token_counters in BaseChunkers by @bhavnicksm in #70
- Add TEVL to speed up sentence chunker by @bhavnicksm in #71
- Add TEVL to speed-up sentence chunking by @bhavnicksm in #72
- Update the docs path to docs.chonkie.ai by @bhavnicksm in #75
- [FEAT] Add BaseRefinery and OverlapRefinery support by @bhavnicksm in #77
- Add support for BaseRefinery and OverlapRefinery + minor changes by @bhavnicksm in #78
- [FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by @bhavnicksm in #79
- [Fix] Unify dataclasses under a types.py for ease by @bhavnicksm in #80
- Expose the seperation delim for simple multilingual chunking by @bhavnicksm in #81
- Bump version to v0.2.2 for release by @bhavnicksm in #82
New Contributors
- @pratyushmittal made their first contribution in #66
Full Changelog: v0.2.1...v0.2.2
v0.2.1.post1
Highlights
This patch fix allows for AutoEmbeddings to properly default to SentenceTransformerEmbeddings
which was being by-passed in the previous release.
Furthermore, because of reconstructable splitting, numerous smaller sentences were making it through to the SemanticChunker. To subvert the issue, this fix introduces a min_chunk_size
which takes in the minimum tokens that need to be in a chunk. This solves the issues in the tests.
What's Changed
- [Fix] AutoEmbeddings not loading
all-minilm-l6-v2
but loadsAll-MiniLM-L6-V2
by @bhavnicksm in #57 - [Fix] Add fix for #55 by @bhavnicksm in #58
- [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
- [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
Full Changelog: v0.2.1...v0.2.1.post1
v0.2.1
Breaking Changes
- SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the
SentenceTransformerEmbeddings
class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside theAutoEmbeddings
class. - By default,
semantic
optional installation now depends onModel2VecEmbeddings
and hencemodel2vec
python package from this release onwards, due to size and speed benefits.Model2Vec
uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency. SemanticChunker
andSDPMChunker
now use the argumentchunk_size
instead ofmax_chunk_size
for uniformity across the chunkers, but the internal representation remains the same.
What's Changed
- [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
- [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
- Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
- Use
__slots__
instead ofslots=True
for python3.9 support by @bhavnicksm in #34 - Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
- [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
- Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
- Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
- [DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
- [FEAT] - Add model2vec embedding models by @sky-2002 in #41
- [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
- [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
- [Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
- [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
- [Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54
New Contributors
- @mrmps made their first contribution in #29
- @jasonacox made their first contribution in #30
- @sky-2002 made their first contribution in #41
Full Changelog: v0.2.0...v0.2.1
v0.2.0.post1
Highlights
This patch was added to fix support for python3.9 with Dataclass slots. Earlier we were using slots=True
which would only work for python 3.10 onwards. This also works in python3.10+ versions.
What's Changed
- Use
__slots__
instead ofslots=True
for python3.9 support by @bhavnicksm in #34 - Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
Full Changelog: v0.2.0.post1...v0.2.0.post2
v0.2.0
Breaking Changes
- Semantic Chunkers will no longer be taking in an additional
tokenizer
object, and would instead infer the tokenizer from theembedding_model
passed. This is done to ensure that thetokenizer
andembedding_model
token counts always match as that is a necessary condition for some of the optimizations on them.
What's Changed
- Update Docs by @bhavnicksm in #14
- Update acknowledgements in README.md for improved clarity and appreci… by @bhavnicksm in #15
- Update README.md + fix DOCS.md typo by @bhavnicksm in #17
- Remove Spacy dependency from Chonkie by @bhavnicksm in #20
- Remove Spacy dependency from 'sentence' install + Add FAQ to DOCS.md by @bhavnicksm in #21
- Update README.md + minor updates by @bhavnicksm in #22
- fix: tokenizer mismatch for
SemanticChunker
+ Add BaseEmbeddings by @bhavnicksm in #24 - Update dependency version of SentenceTransformer to at least 2.3.0 by @bhavnicksm in #27
- Add initial batching support via
chunk_batch
fn + update DOCS by @bhavnicksm in #28 - [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
- [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
- Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
New Contributors
- @mrmps made their first contribution in #29
- @jasonacox made their first contribution in #30
Full Changelog: v0.1.2...v0.2.0.post1
v0.1.2
What's Changed
- Make imports as a part of Chunker init instead of file imports to make Chonkie import faster by @bhavnicksm in #12
- Run Black + Isort + beautify the code a bit by @bhavnicksm in #13
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- Update README.md by @bhavnicksm in #10
- Bump version to 0.1.1 in pyproject.toml and init.py by @bhavnicksm in #11
Full Changelog: v0.1.0...v0.1.1