Feature: Deep Search PDF to MD file conversion #33

nerdalert · 2024-06-26T05:54:32Z

WIP/POC for PDF -> MD file conversion using Deep Search

Issue #2

While the endpoint is getting established, this takes the client side typescript library and mocks up a Go gin server for mocking the file conversion API responses.

All of this code is inferring the API from the client typescript lib in src/lib/api/deepsearch/index.ts. It will certainly require some tuning once the endpoint gets stood up. There was one adjustment I had to make to the getDocumentHashes DS4SD library code removing the / before api/xxx in path: api/cps/public/v2/project/${projKey}/data_indices/${indexKey}/documents/transactions/${transactionId}`` to prevent a double slash and a 404 from being returned. Typo maybe, tbd.

The UI components call the mock Deep Search API for PDF to Markdown conversion with the following operations with NextJS SSRs:

Upload component from the client side sent to the server side rendering.
Implement a POST endpoint to handle PDF to Markdown conversion
Authenticate the user with the username and API key
Launch the conversion task and wait for its completion with the status/wait API call. The API server mock go binary adds a 15 second delay, enough for the client side to send a second GETv2/project/mockProject/celery_tasks/mock-task-id.
Retrieve the transaction ID from the completed task
Fetch document hashes using the transaction ID
Obtain document artifacts using the document hash
Return the document artifacts as the API response to the nextjs client side for rendering the results.

To start the Go mock deep search api server do the following:

cd mock/go-deepseed/
go run deepseed-mock.go

Here are the CURL commands for functional testing and to validate the API calls. Also listing them to add clarity to the operations in the src/app/api/conversion/route.ts code in this PR.

Authenticate

curl -X POST http://localhost:8080/api/cps/user/v1/user/token \
    -H "Content-Type: application/json" \
    -H "Authorization: Basic $(echo -n 'your-username:your-api-key' | base64)"

POST a PDF to convert

curl -X POST http://localhost:8080/api/cps/public/v1/project/mockProject/data_indices/mockIndex/actions/ccs_convert_upload \
    -H "Authorization: mock-token" \
    -H "Content-Type: application/json" \
    -d '{
          "file_url": ["data:application/pdf;base64,<your-base64-pdf>"]
        }'

Wait for the Task ID to return

curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/celery_tasks/mock-task-id?wait=10 \
    -H "Authorization: mock-token"

Get Document Hashes

curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/transactions/mock-transaction-id \
    -H "Authorization: mock-token"

Get Document Artifacts

curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/mock-document-hash/artifacts \
    -H "Authorization: mock-token"

Here is a screencap of a basic upload. Once we get a feel for the conversion times we can consider integrating it directly into the knowledge submission form that will render the PDF doc to MD directly in the upload that then kicks off the process to get it posted in a repo and supply the location+SHA of the docs.

conversion-mockup.mp4

nerdalert · 2024-07-01T03:32:26Z

Latest demo screen cap https://drive.google.com/file/d/1dal2nQHWr3ye1E_zCpIJzyqaOEy84Id5/view

Signed-off-by: Brent Salisbury <[email protected]>

nerdalert marked this pull request as draft June 26, 2024 05:55

nerdalert force-pushed the deepsearch-poc branch 2 times, most recently from 9f4de57 to b808390 Compare July 1, 2024 03:22

nerdalert force-pushed the deepsearch-poc branch 2 times, most recently from 6ee74c7 to 3ede608 Compare July 11, 2024 02:25

nerdalert changed the title ~~POC: Deep Search PDF to MD file conversion~~ Feature: Deep Search PDF to MD file conversion Jul 11, 2024

Deep Search PDF to MD file conversion

24f61d2

Signed-off-by: Brent Salisbury <[email protected]>

nerdalert force-pushed the deepsearch-poc branch from 3ede608 to 24f61d2 Compare July 12, 2024 18:43

vishnoianil added the Demo PR that contains Demo related changes label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Deep Search PDF to MD file conversion #33

Feature: Deep Search PDF to MD file conversion #33

nerdalert commented Jun 26, 2024 •

edited

Loading

nerdalert commented Jul 1, 2024

Feature: Deep Search PDF to MD file conversion #33

Are you sure you want to change the base?

Feature: Deep Search PDF to MD file conversion #33

Conversation

nerdalert commented Jun 26, 2024 • edited Loading

nerdalert commented Jul 1, 2024

nerdalert commented Jun 26, 2024 •

edited

Loading