Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BaseRefinery and OverlapRefinery + minor changes #78

Merged
merged 32 commits into from
Dec 5, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request includes several changes to the chonkie package, adding new functionality and improving existing code. The most important changes include the addition of a new Context class, updates to the Chunk class to include context, and various code simplifications and improvements.

New functionality:

  • Added Context class for storing contextual information for chunk refinement. (src/chonkie/context.py)

Enhancements:

  • Updated Chunk class to include a context attribute and added several methods for better handling and representation of chunks. (src/chonkie/chunker/base.py)
  • Added BaseRefinery and OverlapRefinery to the __init__.py and __all__ list for better modularity. (src/chonkie/__init__.py) [1] [2]

Code simplifications:

  • Simplified the SemanticSentence and SemanticChunk classes by removing the __slots__ attribute and using dataclass fields. (src/chonkie/chunker/semantic.py) [1] [2]
  • Simplified the Sentence and SentenceChunk classes by removing the __slots__ attribute and using dataclass fields. (src/chonkie/chunker/sentence.py) [1] [2]

Configuration updates:

  • Added chonkie.refinery to the packages list in pyproject.toml. (pyproject.toml)

bhavnicksm and others added 30 commits November 25, 2024 16:11
I have added a small documentation on how to setup local env for
testing.

**Changes in pyproject.toml**
`pytest` was not picking up local changes. It was giving
`ModuleNotFoundError` error for `chonkie`. Adding `pythonpath` fixed
that.

**Automated Testing**
I have also added a github action to run the tests automatically on each
`git push`. I have used `uv` because of superfast dependency
installation.

I followed this guide for the GitHub Actions setup:
https://docs.astral.sh/uv/guides/integration/github/
- Added ruff checks for import sorting and docstring formatting
- Fixed docstrings across chunker and embeddings modules to comply with standards
- Updated pyproject.toml to include ruff configuration
- Introduced a new Context class for managing contextual information during chunk refinement.
- Added a new Refinery module with BaseRefinery and OverlapRefinery classes to enhance chunk processing.
- Updated the __init__.py files to include new classes in the package exports.
- Modified the Chunk class to incorporate context attributes.
- Enhanced the pyproject.toml to include the new refinery package.
- Added tests for OverlapRefinery to ensure functionality and correctness.
…updates

- Added error handling for missing embedding model, prompting installation of the `semantic` extra.
- Updated similarity threshold assignment to use the instance variable consistently.
- Introduced a new test for SDPMChunker to validate functionality with percentile-based similarity, ensuring proper chunking behavior and attributes.
[FEAT] Add BaseRefinery and OverlapRefinery support
- Updated the class docstring to better describe the purpose and functionality of the TokenProcessor.
- Simplified the __init__ method formatting for better readability.
- Enhanced error messages for unsupported tokenizer backends.
- Cleaned up whitespace and formatting throughout the file for consistency.
…d structure

- Cleaned up whitespace and formatting in the OverlapRefinery and BaseRefinery classes.
- Updated docstrings for clarity and consistency.
- Adjusted method signatures and internal logic for better readability.
- Ensured consistent use of commas and spacing in function definitions and calls.
- Added a new 'mode' parameter to the OverlapRefinery class to allow context addition to either the prefix or suffix of chunks.
- Implemented separate methods for handling prefix and suffix overlap calculations, including exact and approximate token-based methods.
- Updated the refine method to process chunks based on the selected mode, improving flexibility in context management.
- Enhanced docstrings for clarity and to reflect the new functionality.
- Cleaned up whitespace and formatting across multiple files, including __init__.py, context.py, base.py, semantic.py, sentence.py, and overlap.py.
- Enhanced docstrings for clarity and consistency in the Context and Chunk classes.
- Adjusted method signatures and internal logic for better readability in the BaseChunker and SentenceChunker classes.
- Updated test files to improve formatting and ensure consistent style.
- Ensured consistent use of commas and spacing in function definitions and calls.
- Renamed method for obtaining overlap context from `_get_overlap_context` to `_get_prefix_overlap_context` to clarify its purpose.
- Updated test assertions to reflect changes in context handling, ensuring the last chunk has no context and verifying context for all other chunks.
- Added new tests for prefix mode functionality in OverlapRefinery, ensuring correct context management and merging behavior.
- Adjusted existing tests to improve clarity and accuracy in context validation.
- Changed assertions in `test_overlap_refinery_sentence_chunks` to check context for the first chunk instead of the second.
- Modified the assertion in `test_overlap_refinery_prefix_mode_with_merge` to allow for equal token counts, ensuring clarity in validation messages.
@bhavnicksm bhavnicksm merged commit 27458b3 into main Dec 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants