07 Jan 13:22

e9440ac

v0.4.1 Latest

Latest

Highlights

Now you can see a progress bar when chunking a lot of texts with batch chunking

from chonkie import RecursiveChunker

chunker  = RecursiveChunker()

chunks = chunker([...], show_progress_bar=True)    # progress bar is enabled by default

# 🦛 choooooooooooooooooooonk 100% • 200/200 docs chunked [00:00<00:00, 229.65doc/s] 🌱

What's Changed

Add CONTRIBUTING.md, update issue templates, CI, Codecov and more... by @bhavnicksm in #119
[FEAT] Add TQDM to default installs + CONTRIBUTING.md + other minor updates by @bhavnicksm in #120
[fix] CI: reports were not being uploaded to Codecov by @bhavnicksm in #121
Update CONTRIBUTING.md with first issue hyperlink by @shreyashnigam in #122
[FIX] Support class methods as token_counter objects for CustomEmbeddings (#92) by @bhavnicksm in #127
[Fix] Add fix for #92: Support class.method as a Tokenizer for CustomEmbedding +. minor changes by @bhavnicksm in #128
[FIX] #116: Incorrectstart_index when chunk_overlap is not 0 by @Udayk02 in #126
[FIX] start_index incorrect when chunk_overlap is not 0 (#116) by @bhavnicksm in #132
[FIX] Remove tests for Py3.8 — Incompatible for support by @bhavnicksm in #134
[fix] High chunk_overlap causes last chunk to be entirely redundant by @bhavnicksm in #136
[FIX] Handle edge case for RecursiveChunker (#131) by @bhavnicksm in #137
[DOCS] Update readme intro to match docs. by @shreyashnigam in #135
[FEAT] Add TQDM progress bars for chunk_batch + Update README.md by @bhavnicksm in #138
Replace dead discord link with infinite lifetime by @shreyashnigam in #140
[FIX] Minor fixes + Stylistic enhancements for TQDM and Multiprocessing by @bhavnicksm in #141
[chore] Bump up the package version to v0.4.1 by @bhavnicksm in #143

New Contributors

@shreyashnigam made their first contribution in #122
@Udayk02 made their first contribution in #126

Full Changelog: v0.4.0...v0.4.1

Contributors

bhavnicksm, shreyashnigam, and Udayk02

Assets 2

29 Dec 00:19

bhavnicksm

v0.4.0

6f2fe07

v0.4.0

Highlights

Added the new RecursiveChunker that uses complex recursive rules to create structurally meaningful chunks, maintaining natural separations as much as possible. Try it out~

from chonkie import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker("Woah! Chonkie has it's own recursive chunker now~ so cooool!")

What's Changed

Add initial support for Recursive Chunking (RecursiveChunker) by @bhavnicksm in #107
[FEAT] Add support for RecursiveChunking + minor fixes by @bhavnicksm in #108
[fix] Correct the start and end indices for TokenChunker in Batch mode (#84) by @bhavnicksm in #109
[fix] Correct the start and end indices for TokenChunker in Batch mode (#84) by @bhavnicksm in #110
[fix] #106: Missing last sentence in the SemanticChunker by @bhavnicksm in #112
[fix] Add fix for #106: Reconstruction tests for SemanticChunker failing, missing last sentence by @bhavnicksm in #113
[chore] Bump version to "v0.4.0" + minor change by @bhavnicksm in #114

Full Changelog: v0.3.0...v0.4.0

Contributors

bhavnicksm

Assets 2

23 Dec 18:44

bhavnicksm

v0.3.0

345ca6c

v0.3.0

Highlights

Added LateChunker support! You can use LateChunker in the following manner:

from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="jinaai/jina-embeddings-v3",
    mode="sentence", 
    trust_remote_code=True
)

Added Chonkie Discord to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on Twitter and Bluesky too!
Bunch of bug fixes to improve chunkers' stability...

What's Changed

[Fix] #37: Incorrect indexing when repetition is present in the text by @bhavnicksm in #87
[Fix] #88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by @arpesenti in #89
[Fix] WordChunker chunk_batch fail by @sky-2002 in #90
[FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by @bhavnicksm in #96
Add initial support for Late Chunking by @bhavnicksm in #97
[FEAT] Add LateChunker by @bhavnicksm in #98
[FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by @bhavnicksm in #99
Update version to 0.3.0 in pyproject.toml and init.py by @bhavnicksm in #100
[fix] Add LateChunker support to chunker and module exports by @bhavnicksm in #101
[fix] Docstrings in SemanticChunker should include **kwargs by @bhavnicksm in #102
[Minor] Add Discord badge to README for community engagement by @bhavnicksm in #103

New Contributors

@arpesenti made their first contribution in #89

Full Changelog: v0.2.2...v0.3.0

Contributors

arpesenti, bhavnicksm, and sky-2002

Assets 2

06 Dec 22:56

bhavnicksm

v0.2.2

475f08d

v0.2.2

Highlights

Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
Added auto thresholding mode for SemanticChunkers to remove similarity_threshold hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum
Added OverlapRefinery for adding overlap context to the chunks. chunk_overlap parameter will be deprecated in the future for OverlapRefinery instead.

What's Changed

[Fix] AutoEmbeddings not loading all-minilm-l6-v2 but loads All-MiniLM-L6-V2 by @bhavnicksm in #57
[Fix] Add fix for #55 by @bhavnicksm in #58
[Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
[Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
[Update] Change default embedding model in SemanticChunkers by @bhavnicksm in #63
Add min_chunk_size to SDPMChunker + Lint codebase with ruff + minor changes by @bhavnicksm in #68
Added automated testing using Github Actions by @pratyushmittal in #66
Add support for automated testing with Github Actions by @bhavnicksm in #69
[Fix] Allow for functions as token_counters in BaseChunkers by @bhavnicksm in #70
Add TEVL to speed up sentence chunker by @bhavnicksm in #71
Add TEVL to speed-up sentence chunking by @bhavnicksm in #72
Update the docs path to docs.chonkie.ai by @bhavnicksm in #75
[FEAT] Add BaseRefinery and OverlapRefinery support by @bhavnicksm in #77
Add support for BaseRefinery and OverlapRefinery + minor changes by @bhavnicksm in #78
[FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by @bhavnicksm in #79
[Fix] Unify dataclasses under a types.py for ease by @bhavnicksm in #80
Expose the seperation delim for simple multilingual chunking by @bhavnicksm in #81
Bump version to v0.2.2 for release by @bhavnicksm in #82

New Contributors

@pratyushmittal made their first contribution in #66

Full Changelog: v0.2.1...v0.2.2

Contributors

pratyushmittal and bhavnicksm

Assets 2

24 Nov 14:41

bhavnicksm

v0.2.1.post1

7b1e480

v0.2.1.post1

Highlights

This patch fix allows for AutoEmbeddings to properly default to SentenceTransformerEmbeddings which was being by-passed in the previous release.

Furthermore, because of reconstructable splitting, numerous smaller sentences were making it through to the SemanticChunker. To subvert the issue, this fix introduces a min_chunk_size which takes in the minimum tokens that need to be in a chunk. This solves the issues in the tests.

What's Changed

[Fix] AutoEmbeddings not loading all-minilm-l6-v2 but loads All-MiniLM-L6-V2 by @bhavnicksm in #57
[Fix] Add fix for #55 by @bhavnicksm in #58
[Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
[Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62

Full Changelog: v0.2.1...v0.2.1.post1

Contributors

bhavnicksm

Assets 2

22 Nov 10:40

bhavnicksm

v0.2.1

f5768e8

v0.2.1

Breaking Changes

SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the SentenceTransformerEmbeddings class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside the AutoEmbeddings class.
By default, semantic optional installation now depends on Model2VecEmbeddings and hence model2vec python package from this release onwards, due to size and speed benefits. Model2Vec uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency.
SemanticChunker and SDPMChunker now use the argument chunk_size instead of max_chunk_size for uniformity across the chunkers, but the internal representation remains the same.

What's Changed

[BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
[DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
[FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
[DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
[FEAT] - Add model2vec embedding models by @sky-2002 in #41
[FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
[fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
[Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
[Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
[Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54

New Contributors

@mrmps made their first contribution in #29
@jasonacox made their first contribution in #30
@sky-2002 made their first contribution in #41

Full Changelog: v0.2.0...v0.2.1

Contributors

jasonacox, bhavnicksm, and 2 other contributors

Assets 2

18 Nov 09:32

bhavnicksm

v0.2.0.post2

6227b48

v0.2.0.post1

Highlights

This patch was added to fix support for python3.9 with Dataclass slots. Earlier we were using slots=True which would only work for python 3.10 onwards. This also works in python3.10+ versions.

What's Changed

Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35

Full Changelog: v0.2.0.post1...v0.2.0.post2

Contributors

bhavnicksm

Assets 2

17 Nov 13:27

bhavnicksm

v0.2.0.post1

0dd5ecb

v0.2.0

Breaking Changes

Semantic Chunkers will no longer be taking in an additional tokenizer object, and would instead infer the tokenizer from the embedding_model passed. This is done to ensure that the tokenizer and embedding_model token counts always match as that is a necessary condition for some of the optimizations on them.

What's Changed

Update Docs by @bhavnicksm in #14
Update acknowledgements in README.md for improved clarity and appreci… by @bhavnicksm in #15
Update README.md + fix DOCS.md typo by @bhavnicksm in #17
Remove Spacy dependency from Chonkie by @bhavnicksm in #20
Remove Spacy dependency from 'sentence' install + Add FAQ to DOCS.md by @bhavnicksm in #21
Update README.md + minor updates by @bhavnicksm in #22
fix: tokenizer mismatch for SemanticChunker + Add BaseEmbeddings by @bhavnicksm in #24
Update dependency version of SentenceTransformer to at least 2.3.0 by @bhavnicksm in #27
Add initial batching support via chunk_batch fn + update DOCS by @bhavnicksm in #28
[BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
[DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32

New Contributors

@mrmps made their first contribution in #29
@jasonacox made their first contribution in #30

Full Changelog: v0.1.2...v0.2.0.post1

Contributors

jasonacox, bhavnicksm, and mrmps

Assets 2

08 Nov 17:26

bhavnicksm

v0.1.2

745e5e8

v0.1.2

What's Changed

Make imports as a part of Chunker init instead of file imports to make Chonkie import faster by @bhavnicksm in #12
Run Black + Isort + beautify the code a bit by @bhavnicksm in #13

Full Changelog: v0.1.1...v0.1.2

Contributors

bhavnicksm

Assets 2

07 Nov 19:06

bhavnicksm

v0.1.1

3b1fa22

v0.1.1

What's Changed

Update README.md by @bhavnicksm in #10
Bump version to 0.1.1 in pyproject.toml and init.py by @bhavnicksm in #11

Full Changelog: v0.1.0...v0.1.1

Contributors

bhavnicksm

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

Contributors

Breaking Changes

What's Changed

New Contributors

Contributors

Highlights

What's Changed

Contributors

Breaking Changes

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: chonkie-ai/chonkie

v0.4.1

Highlights

What's Changed

New Contributors

Contributors

v0.4.0

Highlights

What's Changed

Contributors

v0.3.0

Highlights

What's Changed

New Contributors

Contributors

v0.2.2

Highlights

What's Changed

New Contributors

Contributors

v0.2.1.post1

Highlights

What's Changed

Contributors

v0.2.1

Breaking Changes

What's Changed

New Contributors

Contributors

v0.2.0.post1

Highlights

What's Changed

Contributors

v0.2.0

Breaking Changes

What's Changed

New Contributors

Contributors

v0.1.2

What's Changed

Contributors

v0.1.1

What's Changed

Contributors