OneDrive Client with Two-way Synchronizing Feature

These details have not been verified by PyPI

Project links

Project description

Two-way OneDrive Synchronization Client

DISCLAIMER: This utility is in its early stage. Although it is designed to fail as soon as possible when conflicts are detected in order to avoid data loss, it is not guaranteed that your data is absolutely safe. Run it in a testing environment before you know what is going on

Based on Microsoft Graph, written in Python, this utility will detect changes you made locally and changes happened in the cloud, and try to merge them on both sides. If it fails due to conflicts, for example, you renamed a file locally but assigned the same file with another filename on your phone, this utility will complain and fail before making any operations, ensuring your data to remain intact.

To try it out, simply type this into your terminal (supposing you are using GNU/Linux and ~/.local/bin is in your $PATH)

pip install --user onedrive-sync-client # Install
# If you want to use the latest snapshot use the following line instead
# pip install --user https://github.com/gzxu/onedrive-sync-client/archive/master.zip
onedrive --set-location ~/OneDrive      # Config
onedrive                                # Sync!

Currently this utility utilizes extended attributes to save file identifiers locally, and modern filesystems and distributions should have this feature enabled by default. An SQLite database will be created and used as a central place to save essential information from the first time you run it, like your login token or the state of the whole filesystem tree at the last synchronization. We will ask you for the permissions to access your OneDrive files (of course), and to "access your information at any time", which means you do not need to login and authorize every time you use it.

Algorithm

An essential background is that OneDrive, like almost every cloud storage providers, assigns an globally-unique identifier to every file or directory (folder), because this identifier is also linked to other metadata like whether this file is allowed to be viewed by anonymous visitors, and the list of metadata is subject to grow at any time. This effectively reduces the complexity of our algorithm.

Firstly, the filesystem tree hierarchy will be constructed on both sides. OneDrive provided us with a simple API to dump the whole tree on the cloud, recursively, with the identifier and name of each item, and additionally the checksum of each file (Current implementation is different). The locally tree is easily constructed with aforementioned information, and the identifier information can be read from the extended attributes of each file or directory. You can inspect the locally stored identifier to FILENAME by getfattr -n user.onedrive.id FILENAME. Originally extended attributes are used because it moves with the corresponding file, but the default file manager, Nautilus, copies all extended attributes when copying files. This results in duplicated identifiers. Also, there will be no identifiers for new locally created files, so the locally constructed tree needs to assign a temporary identifier to each file, and maintain a mapping between the real identifiers and temporary identifiers. At the same time, the saved state of the tree at the last synchronization is also loaded.

Then there are three trees, the cloud tree, the local tree and the saved tree. We need to merge them in a two-way manner. The cloud tree is then compared with the saved tree. Although we call them "trees", they are actually lists of nodes sorted by their identifiers. Each node stores its name, checksum if applies, and the identifier of its parent. For each identifier, if the corresponding nodes are the same in the cloud tree and in the saved tree, it is considered as unchanged. Even if its parent is renamed or moved, as long as the identifier of its parent remain unchanged, this node will move along with its parent. If an identifier only exists in the cloud tree, it must be newly created; if it only exists in the base tree, it must be removed in the cloud since the last synchronization. If the identifier of its parent is different, it must be moved. If its name or checksum is changed, it must be renamed or overridden. Thence, we get a change set between the cloud tree and the saved tree. Similar approaches made for the local tree, but as there would be duplicated identifiers, for every duplication, the most similar one among them will be kept, and other ones will be treated as newly created. This is not optimal, because if we copied a folder locally, there will lots of upload traffic as we need to upload the whole folder.

Now we have two change sets, one is between the cloud tree and the saved tree, and the other is between the local tree and the saved tree. Each change set consists of several operations, and each operation is in one of the types listed below:

Create a file with a given name and a given checksum to a given directory
Delete a given file
Override the content of a given file with given checksum
Rename a given file with a given name and/or move it to a given parent directory
Create a directory with a given name to a given directory
Delete a given directory (This directory must be empty)
Rename a given directory with a given name and/or move it (along with all its children) to another given parent directory
Copy a file or directory (Currently unimplemented and omitted)

This set generated from the previous step is unordered, but they must be applied in an order. Some permutations of these steps are acceptable, but other ones causes conflicts. For example, you cannot create a file before the creation of its parent. Operations may conflict with operations in the same set, which represents the actual possible order of changes from the saved tree to the new trees; they may also conflict with operations in the other set, which may render the merging process unattainable.

Conflict checking and script sorting

There are two kinds of conflicts. One of them is conflicts between the two sets, that is, the same node cannot be both modified by the two sets. Another one is, actually more than conflicts, is sorting inside one change set and let it to be ordered. The order does not need to be considered between the two sets, because the cloud set must be applied after the local one, and actually it has been applied before the local one.

For each identifier, the corresponding node cannot be applied with conflicting operations. Conflicting operations may not happen inside one change set but only happen between the two sets, because they are generated from the same trees. Two operations with exactly the same parameters do not conflict, but if not, they conflict with each other if they have the same type of operation. Additionally, removing a node conflict with any other operation on that node. Adding node does not need to be considered here, because there will be no other operations on the same identifier.

The sorting problem is a topological sorting problem. Every operation is modeled as a vertex, and if an operation should happen after another one, there will be an oriented edge between them. Loops in this graph indicate conflict.

In order to find conflicts between different identifiers, the prerequisites and effects of each operation are marked and indexed. There are two types of prerequisites, and two corresponding types of effects: the given directory to exist and the given name inside the given directory is not occupied. An operation should be applied after these operations providing the corresponding effects to its prerequisites.

Create file
- Prerequisite: The destination must exist
- Prerequisite: The name in the destination must be available
Delete file
- Effect: Release that name
Override file
- No prerequisites nor effects
Rename and move file
- Prerequisite: The destination must exist
- Prerequisite: The name in the destination must be available
- Effect: Release the original name
Create directory
- Prerequisite: The destination must exist
- Prerequisite: The name in the destination must be available
- Effect: This directory begins to exist
Delete directory
- Prerequisite: All current names must be released
- Effect: Release that name
Rename and move directory
- Prerequisite: The destination must exist
- Prerequisite: The name in the destination must be available
- Effect: Release the original name

Actually there are cases that is not covered by this algorithm, for example, creating two files with different identifiers (of course) but under the same directory with the same name is not detected. Such conflicts happen between the two change sets, and is only relevant to name conflicts. There should be more unrevealed conflicts, and thence the generated script should be tested before applied.

Script checking and optimization

After topological sorting, the generated script is applied to a testing copy of the two trees, and if there is an operation to remove a directory, any operations to remove items from that directory can be marked as omitted, as an optimization.

Applying Script and Saving Trees

The REST API is described on the Microsoft Graph documentation site, and there are some points to be specifically noticed. The tree is saved to the database as tables of identifiers and properties.

Delta listing

OneDrive provides an API to dump the whole tree. This tree may be huge and the transfer process may be slow. To address this problem, the result of the last dump query can be saved, and use the delta API to generate the real cloud tree. When the delta API is not available, the fallback method can be used instead.

Batch requests

A request agent can be used to temporarily store requests and use the batch API to save traffic, before any force flushing. Also, some HTTP library supports HTTP 2.0 to additionally save traffic. However, this involves a more complex algorithm.

File downloading and uploading

As OneDrive supports partial downloading, a download manager is needed especially when the file is large. There are two upload APIs, and an upload manager is also needed. The /content API is easier to be used, and is suitable for files less than 4 MiB under good network condition. When the file is too large or when the network condition is bad, the /createUploadSession API have to be used. To upload file via this API, the parent identifier and the file name must be provided, and create an upload session with only the file identifier is unfeasible due to a bug of the server.

Known issues

Some applications, namely Gedit, will erase the saved attributes attached to the file, because it virtually creates a new file and removed the old one. As the id information is lost, we have to do the same thing (remove and re-upload) when synchronizing
The delta feature is temporarily disabled due to an unresolved bug in this utility

Future works

Solve the aforementioned issue by detecting missing identifiers and prompting the user
Solve the aforementioned bug and revise the algorithm to cover the aforementioned situation
Optimize deletion in the cloud by pruning
Transactional syncing to avoid unpredictable exceptions
Optimize by omitting changes that is the same in the two change sets
Add option to force override local tree with the cloud one or vice versa
Use st_mtime_ns to detect changes instead of checksums for faster local tree constructing and OneDrive for Business and SharePoint Server 2016 support. Checksums can be used as an auxiliary method to detect local changes, and eTags can be used to detect changes in the cloud
~~Use DAO when handling databases when possible~~
Use SQLAlchemy as the ORM engine
Enable Windows and/or macOS support (easy but probably unnecessary job)
Come up with a better model describing this problem and revise the algorithm based on this
Replenish the documentation in comments
Sweep bug out by introducing unit tests
Agent for batch requests, as mentioned above
Download and upload manager for unstable network connection
Download and upload manager with multi-threading support
Utilize the copy API, however as this an asynchronous one, parallel programming is a necessity
HTTP 2.0 support with libraries other than requests
Revise the commandline user interface by list out necessary information in a human-readable manner
Properly handle every possible exceptions
Introduce a logging framework
GUI with neat design
Port to C++ when possible

License

This project is licensed under the GNU Affero General Public License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.4

Aug 16, 2018

0.2.3

Jul 2, 2018

0.2.2

Jun 18, 2018

0.2.1

Jun 17, 2018

0.2.0

Jun 17, 2018

0.1.4

Jun 7, 2018

0.1.3

Jun 6, 2018

0.1.2

Jun 2, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onedrive-sync-client-0.2.4.tar.gz (22.7 kB view details)

Uploaded Aug 16, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

onedrive_sync_client-0.2.4-py3-none-any.whl (28.9 kB view details)

Uploaded Aug 16, 2018 Python 3

File details

Details for the file onedrive-sync-client-0.2.4.tar.gz.

File metadata

Download URL: onedrive-sync-client-0.2.4.tar.gz
Upload date: Aug 16, 2018
Size: 22.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.6

File hashes

Hashes for onedrive-sync-client-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`b6aa5af2eb311c8de6bc517cf8bfe6adb66359c3d81a4a1c0e208df84569d356`
MD5	`1daa8e5f5f710c499980c8c327e63138`
BLAKE2b-256	`9c89d36e7ad064c0a83bfd3837e50c9001ba6529c237424f8c01b2c9a17124aa`

See more details on using hashes here.

File details

Details for the file onedrive_sync_client-0.2.4-py3-none-any.whl.

File metadata

Download URL: onedrive_sync_client-0.2.4-py3-none-any.whl
Upload date: Aug 16, 2018
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.6

File hashes

Hashes for onedrive_sync_client-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d7e52677a6ddee6f19183ed3528e2a0314471a8fe15a1cb0fd44388173d7ef8`
MD5	`07415cac49275a8ca15245aa7c91e45b`
BLAKE2b-256	`2bfd0776beea7466cab1c3aede116df49758672af5e0a97bbd8fc91ca1451375`

See more details on using hashes here.

onedrive-sync-client 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Two-way OneDrive Synchronization Client

Algorithm

Conflict checking and script sorting

Script checking and optimization

Applying Script and Saving Trees

Delta listing

Batch requests

File downloading and uploading

Known issues

Future works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes