-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BaseRefinery and OverlapRefinery + minor changes #78
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have added a small documentation on how to setup local env for testing. **Changes in pyproject.toml** `pytest` was not picking up local changes. It was giving `ModuleNotFoundError` error for `chonkie`. Adding `pythonpath` fixed that. **Automated Testing** I have also added a github action to run the tests automatically on each `git push`. I have used `uv` because of superfast dependency installation. I followed this guide for the GitHub Actions setup: https://docs.astral.sh/uv/guides/integration/github/
- Added ruff checks for import sorting and docstring formatting - Fixed docstrings across chunker and embeddings modules to comply with standards - Updated pyproject.toml to include ruff configuration
- Introduced a new Context class for managing contextual information during chunk refinement. - Added a new Refinery module with BaseRefinery and OverlapRefinery classes to enhance chunk processing. - Updated the __init__.py files to include new classes in the package exports. - Modified the Chunk class to incorporate context attributes. - Enhanced the pyproject.toml to include the new refinery package. - Added tests for OverlapRefinery to ensure functionality and correctness.
…updates - Added error handling for missing embedding model, prompting installation of the `semantic` extra. - Updated similarity threshold assignment to use the instance variable consistently. - Introduced a new test for SDPMChunker to validate functionality with percentile-based similarity, ensuring proper chunking behavior and attributes.
[FEAT] Add BaseRefinery and OverlapRefinery support
- Updated the class docstring to better describe the purpose and functionality of the TokenProcessor. - Simplified the __init__ method formatting for better readability. - Enhanced error messages for unsupported tokenizer backends. - Cleaned up whitespace and formatting throughout the file for consistency.
…d structure - Cleaned up whitespace and formatting in the OverlapRefinery and BaseRefinery classes. - Updated docstrings for clarity and consistency. - Adjusted method signatures and internal logic for better readability. - Ensured consistent use of commas and spacing in function definitions and calls.
- Added a new 'mode' parameter to the OverlapRefinery class to allow context addition to either the prefix or suffix of chunks. - Implemented separate methods for handling prefix and suffix overlap calculations, including exact and approximate token-based methods. - Updated the refine method to process chunks based on the selected mode, improving flexibility in context management. - Enhanced docstrings for clarity and to reflect the new functionality.
- Cleaned up whitespace and formatting across multiple files, including __init__.py, context.py, base.py, semantic.py, sentence.py, and overlap.py. - Enhanced docstrings for clarity and consistency in the Context and Chunk classes. - Adjusted method signatures and internal logic for better readability in the BaseChunker and SentenceChunker classes. - Updated test files to improve formatting and ensure consistent style. - Ensured consistent use of commas and spacing in function definitions and calls.
- Renamed method for obtaining overlap context from `_get_overlap_context` to `_get_prefix_overlap_context` to clarify its purpose. - Updated test assertions to reflect changes in context handling, ensuring the last chunk has no context and verifying context for all other chunks. - Added new tests for prefix mode functionality in OverlapRefinery, ensuring correct context management and merging behavior. - Adjusted existing tests to improve clarity and accuracy in context validation.
- Changed assertions in `test_overlap_refinery_sentence_chunks` to check context for the first chunk instead of the second. - Modified the assertion in `test_overlap_refinery_prefix_mode_with_merge` to allow for equal token counts, ensuring clarity in validation messages.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes several changes to the
chonkie
package, adding new functionality and improving existing code. The most important changes include the addition of a newContext
class, updates to theChunk
class to include context, and various code simplifications and improvements.New functionality:
Context
class for storing contextual information for chunk refinement. (src/chonkie/context.py
)Enhancements:
Chunk
class to include acontext
attribute and added several methods for better handling and representation of chunks. (src/chonkie/chunker/base.py
)BaseRefinery
andOverlapRefinery
to the__init__.py
and__all__
list for better modularity. (src/chonkie/__init__.py
) [1] [2]Code simplifications:
SemanticSentence
andSemanticChunk
classes by removing the__slots__
attribute and usingdataclass
fields. (src/chonkie/chunker/semantic.py
) [1] [2]Sentence
andSentenceChunk
classes by removing the__slots__
attribute and usingdataclass
fields. (src/chonkie/chunker/sentence.py
) [1] [2]Configuration updates:
chonkie.refinery
to thepackages
list inpyproject.toml
. (pyproject.toml
)